Probabilistic and Semantic Descriptions of Image Manifolds and Their Applications

This paper begins with a description of methods for estimating image probability density functions that reflects the observation that such data is usually constrained to lie in restricted regions of the high-dimensional image space-not every pattern of pixels is an image. It is common to say that images lie on a lower-dimensional manifold in the high-dimensional space. However, it is not the case that all points on the manifold have an equal probability of being images. Images are unevenly distributed on the manifold, and our task is to devise ways to model this distribution as a probability distribution. We therefore consider popular generative models. For our purposes, generative/probabilistic models should have the properties of 1) sample generation: the possibility to sample from this distribution with the modelled density function, and 2) probability computation: given a previously unseen sample from the dataset of interest, one should be able to compute its probability, at least up to a normalising constant. To this end, we investigate the use of methods such as normalising flow and diffusion models. We then show how semantic interpretations are used to describe points on the manifold. To achieve this, we consider an emergent language framework that uses variational encoders for a disentangled representation of points that reside on a given manifold. Trajectories between points on a manifold can then be described as evolving semantic descriptions. We also show that such probabilistic descriptions (bounded) can be used to improve semantic consistency by constructing defences against adversarial attacks. We evaluate our methods with improved semantic robustness and OoD detection capability, explainable and editable semantic interpolation, and improved classification accuracy under patch attacks. We also discuss the limitation in diffusion models.


INTRODUCTION
Understanding the complex probability distribution of the data is essential for image authenticity and quality analysis, but is challenging due to its high dimensionality and intricate domain variations (Pope et al., 2021;Gomtsyan et al., 2019).Seen images usually have high probabilities on a low-dimensional manifold embedded in the higher-dimensional space of the image encoder.Nevertheless, the phenomenon that image embeddings encoded using methods such as a pretrained CLIP encoder (Ramesh et al., 2020) lie within a narrow cone of the unit sphere instead of the entire sphere (Tyshchuk et al., 2023;Gao et al., 2019), which degrades the aforementioned pattern of probability distribution.Hence, on such a manifold, it is unlikely that every point can be decoded into a realistic image because of the unevenly distributed probabilities.Therefore, it is important to compute the probability in the latent space to indicate whether the corresponding image is in a high-density region of the space (Hajri et al., 2017;Klein et al., 2022;Grover et al., 2018;Papamakarios et al., 2021;Chang et al., 2017;Lobato et al., 2016;Coeurdoux et al., 2022).This helps to distinguish seen images from unseen images, or synthetic images from real images.Some works train a discriminator with positive (real) and negative (synthetic) examples in the manner of contrastive learning (Liu et al., 2022) or analyse their frequency differences (Wang et al., 2020).However, they do not address this problem using the probabilistic framework afforded by modern generative models.
In this work, we calculate the exact log-probability of an image by utilising generative models that assign high probabilities to seen images and low probabilities to unseen images.The confidence of such probabilities is usually related to image fidelity, we hence also introduce efficient and effective (with improved semantic robustness) generation strategies using hierarchical structure and large sampling steps with the Runge-Kutta method (RK4) (Runge, 1895;Kutta, 1901) for stabilisation.Specifically, we use normalising flow (NF) (Rezende and Mohamed, 2016;Papamakarios et al., 2021) and diffusion models (DMs) (Ho et al., 2020;Luo, 2022;Song et al., 2021) as image generators.NF models learn an image embedding space that conforms to a predefined distribution, usually a Gaussian.In contrast, DMs diffuse images with Gaussian noise in each forward step and learn denoising gradients for the backward steps.A random sample from the Gaussian distribution can then be analytically represented on an image manifold and visualised through an image decoder (for NF models) or denoiser (for diffusion models).In prior works, NF for exact likelihood estimation (Rezende and Mohamed, 2016;Kobyzev et al., 2019;Zhang and Chen, 2021) and with hierarchical structure (Hu et al., 2023;Liang et al., 2021;Voleti et al., 2023) have been explored in model training.To the best of our knowledge, however, it has not been studied by investigating such likelihood distribution of seen and unseen images with a hierarchical structure (without losing the image quality) from the manifold perspective.This is also applied to the diffusion models noting the difficulty of combining such exact likelihood with the mean squared error loss in diffusion training.
Samples from these image generators can be thought of having several meaningful semantic attributes.It is often desirable that these attributes be orthogonal to each other in the sample latent space so as to achieve a controllable and interpretable representation.In this work, we disentangle semantics in the latent space by using a variational autoencoder (VAE) (Kingma and Welling, 2013) in the framework of emergent languages (EL) (Havrylov and Titov, 2017;Mu et al., 2023;Tucker et al., 2021;Kubricht et al., 2020;Pang et al., 2020).This allows the latent representation on the manifold to be more robust, interpretable, compositional, controllable, and transferable.Although some VAE variant models such as β-TCVAE (Chen et al., 2018), GuidedVAE (Ding et al., 2020), and DCVAE (Parmar et al., 2021) achieve qualified semantic disentanglement results, we mainly focus on understanding the effectiveness of the emergent language framework for VAE based disentanglement inspired by (Xu et al., 2022) and emphasizing the feasibility of our GridVAE (with mixture of Gaussian priors) under such an EL framework to study semantic distributions on the image manifold.We also evaluate their semantic robustness on such a manifold against adversarial and patch attacks (Brown et al., 2017;Liu et al., 2020;Hwang et al., 2023;Xiang et al., 2021;Chou et al., 2019;Madry et al., 2018;Tramer et al., 2017;Carlini and Wagner, 2016) and defend against the same attacks using semantic consistency with a purification loss.
We organise this paper into three sections, each with their own experiments: log-likelihood estimation for a given image under normalising flows and diffusion models (see Section 2), semantic disentanglement in emergent languages for a latent representation of object attributes, using a proposed GridVAE model (see Section 3), and adversarial attacks and defences in image space to preserve semantics (see Section 4).

LIKELIHOOD ESTIMATION WITH IMAGE GENERATORS
We evaluate the log-probability of a given image using 1) a hierarchical normalising flow model, 2) a diffusion model adapted to taking large sampling steps, and 3) a diffusion model that uses a higher-order solution to increase generation robustness.

Hierarchical Normalising Flow Models
Figure 1.A 4-level hierarchical normalising flow model, where each level involves the functions (f i , g i , g ′ i , f ′ i ).The normalising flow (NF) model is based on Glow (Kingma and Dhariwal, 2018); the downsampling block decreases image resolution by a factor of two; and the output of each higher (i > 1) level is conditioned on the output of the lower level.We first train all autoencoders {f i , f ′ i } jointly, then train all flows {g i , g ′ i } jointly, to obtain the generated image x ′ 1 .The latent variable z i conforms to the standard Gaussian distribution N (0, 1) during training; at test time, z i is sampled from N (0, 1) for image generation.
Normalising flow (NF) refers to a sequence of invertible functions that may be used to transform a high-dimensional image space into a low-dimensional embedding space corresponding to a probability distribution, usually Gaussian.Dimensionality reduction is achieved via an autoencoding framework.For the hierarchical model, the latent vector corresponding to the image x i at each level i is computed as and the inversion of this process reconstructs the latent z ′ i to x ′ i as where the decoder f ′ i and flow inverse function g ′ i are inversions of the encoder f i and flow function g respectively, and z ′ i can be z i or randomly sampled from N (0, 1).We illustrate hierarchical autoencoders and flows for rich and high-level spatial information with conditioning variables in either image space or latent space.In Fig. 1, we show a 4-level hierarchical normalising flow model, where each set of functions (f i , g i , g ′ i , f ′ i ) corresponds to one level and where g ′ i and f ′ i are conditioned on the higher-level reconstruction, that is The model is learned in two phases: joint learning of all autoencoders {f i , f ′ i } and then joint learning of all flows {g i , g ′ i } with the pretrained autoencoders, for all i ∈ {1, 2, 3, 4}.The loss function for autoencoder learning, denoted L ae , is the mean squared error (MSE) between the reconstructed data and the processed data, and for the learning of flows the objective is to minimise the negative log-probability of y i , denoted L flow , such that the represented distribution of the latent variable is modelled to be the standard Gaussian distribution, from which a random latent variable can be sampled for data generation.Given N pixels and C channels (C = 3 for an RGB image and C = 1 for a greyscale image), x i at level i can be represented as x i = {x ij } for all j ∈ {1, ..., N }, the autoencoder loss is then given by and the flow loss for the latent at level i is the negative log-probability of y i , that is L flow (y i ) = − log p Y (y i ), using the change of variables as where d i is the dimension of the ith latent and J X (•) computes the Jacobian matrix over the partial derivative X.
Similarly, the log-probability of x i at level i is Then, the log-probability of an image at level i with hierarchical autoencoders and flows from multiple downsampling layers, x i+1 = d(x i ) at level i, can be calculated with the chain rule as where [•] is a binary indicator.

Diffusion Models
Different from normalising flow models that sample in a low-dimensional embedding space due to the otherwise large computational complexity, diffusion models diffuse every image pixel in the image space independently, enabling pixelwise sampling from the Gaussian distribution.We outline below a strategy and formulas to allow uneven or extended step diffusion in the backward diffusion process.

Multi-step Diffusion Sampling
Forward process.The standard description of denoising diffusion model (Ho et al., 2020) defines a sequence of random variables {x 0 , x 1 , . . ., x T } according to a forward diffusion process where β t = 1 − α t , x t is a sample from a random variable X t , and ϵ is a sample from the standard (multidimensional) Gaussian.The index t takes integer values between 0 and T , and the set of random variables form a Markov chain.
The idea can be extended to define a continuous family of random variables according to the rule where βt = 1 − ᾱt , and for simplicity, we can assume that x t is defined for t taking continuous values in the interval [0, 1].Here, the values ᾱt are a decreasing function of t with ᾱ0 = 1 and ᾱ1 = 0.It is convenient to refer to t as time.
It is easily seen that if {0 = t 0 , t 1 , ..., t T = 1} are an increasing set of time instants between 0 and 1, then the sequence of random variables {X t 0 , . . .X t T } form a Markov chain.Indeed, it can be computed that for 0 ≤ s < t ≤ 1, the conditional probabilities p(x t |x s ) are Gaussian where ᾱst = ᾱt /ᾱ s and βst = 1 − ᾱst .This is the isotropic normal distribution having mean √ ᾱst x s and variance βst .Similarly to Eq. ( 9), one has This applies in particular when s and t refer to consecutive time instants t i and t i+1 .In this case, the joint probability of {X t 0 , . . ., X t T } is given by p(x t 0 , x t 1 , . . ., One also observes, from Eq. (10) that p(x 1 ) is a standard Gaussian distribution.A special case is where the time steps are chosen evenly spaced between 0 and 1.Thus, if h = 1/T , this can be written as Backward process.The joint probability distribution is also a Markov chain, which can be written in the reverse order, as p(x t 0 , x t 1 , . . ., This allows us to generate samples from X 0 by choosing a sample from X 1 = X t T (a standard Gaussian distribution) and then successively sampling from the conditional probability distributions p(x t i−1 |x t i ).
Unfortunately, although the forward conditional distributions p(x t i |x t i−1 ) are known Gaussian distributions, the backward distributions are not known and are not Gaussian.In general, for s < t, the conditional distribution p(x t |x s ) is Gaussian, but the inverse p(x s |x t ) is not.
However, if (t − s) is small, or more exactly, if the variance of the added noise, βst = 1 − ᾱst is small, then the distributions can be accurately approximated by Gaussians with the same variance βst as the forward conditionals.With this assumption, the form of the backward conditional p(x s |x t ) is specified just by determining its mean, denoted by µ(x s |x t ).The training process of the diffusion model consists of learning (using a neural network) the function µ(x s |x t ) as a function of x t .As explained in Ho et al. (2020), it is not necessary to learn this function for all pairs (s, t), as will be elaborated below.
We follow and generalize the formulation in Ho et al. (2020).The training process learns a function ϵ θ (x t , t) that minimizes the expected least-squared loss function where x t = √ ᾱt x 0 + βt ϵ.As such it estimates (exactly, if the optimum function ϵ θ is found) the expected value of the added noise, given x t (note that it estimates the expected value of the added noise, and not the actual noise, which cannot be predicted).In this case, following Ho et al. (2020), In this form, this formula is easily generalized to As for the variance of p(x s |x t ), in Ho et al. (2020) it is assumed that the p(x t−1 |x t ) is an isotropic Gaussian (although in reality, it is not exactly a Gaussian, nor exactly isotropic).The covariance matrix of this Gaussian is denoted by σ 2 st I, and two possible choices are given, which are generalized naturally to As pointed out in Ho et al. (2020) both of these are compromises.The first choice expresses the approximation that the variance of the noise added in the backward process is equal to the variance in the backward process.As mentioned, this is true for small time steps.
Thus, in our work, we choose to model the reverse conditional as follows, where µ(x s |x t ) is given by Eq. ( 18) and σ 2 st is given by Eq. ( 19).This is an approximation of the true conditional probability p(x s |x t ).

Probability Estimation
In the following, we choose a finite set of T time instances (usually equally spaced) {0 = τ 0 , τ 1 , . . ., τ T = 1} and consider the Markov chain consisting of the variables X τ t , for t = {0, . . ., T }, at these time instances.For simplicity, we use the notation X t instead of X τ t and x t a sample from the corresponding random variable.Then, the notation corresponds to the common notation in the literature, but also applies in the case of unevenly, or widely sampled time instants.
To distinguish between the true probabilities of the variables X t and the modelled conditional probabilities, the true probabilities will be denoted by q (instead of p which was used previously).The modelled probabilities will be denoted by p θ (x t−1 |x t ), and the probability distribution of X T , which is Gaussian, will be denoted by p(x T ).
The image probability can be calculated by using the forward and backward processes for each step of a pretrained diffusion model.The joint probability p(x 0:T ) and the probability of clean input x 0 can be computed using the forward and backward conditional probability, q(x t+1 |x t ) and p θ (x t |x t+1 ) respectively.Each sampling pair (x t , x t+1 ) where t ∈ S = {0, 1, 2, ..., T − 1}, follows the Markov chain rule resulting in the joint probability The negative log-probability of the input image x 0 is then Computing Eq. ( 23) can be decomposed into three steps: 1) Calculating log p(x T ).Since x 0 is fully diffused after T forward steps, x T follows the standard Gaussian distribution N (0, 1), the negative log-likelihood only depends on the Gaussian noises.
3) Calculating log p θ (x t |x t+1 ).Similarly, the probability p θ (x t−1 |x t ) is modelled as a Gaussian, with mean and variance given by Eq. ( 18) and Eq. ( 19) (where s = t − 1) the backward conditional probabilities are easily computed.

Higher-order Solution
With the hypothesis that high-fidelity image generation is capable of maintaining image semantics, in each of the diffusion inversion steps the x 0 estimation and log-likelihood calculation should be stable and reliable with a small distribution variance.The diffusion inversion, however, usually requires a sufficiently small sampling step h, where DDPM (Ho et al., 2020) only supports h = 1 and DDIM is vulnerable to h (Song et al., 2021) as evidenced in Fig. 8.It is important to alleviate the effect of h on the generation step by stabilizing the backward process in diffusion models.
Without loss of generality, the Runge-Kutta method (RK4) (Runge, 1895;Kutta, 1901) can achieve a stable inversion process by constructing a higher-order function to solve an initial value problem.Different from the traditional RK4, the diffusion inversion requires inverse-temporal updates because of the denoising gradient direction from the initial noisy image at t = T to the clean image at t = 0. We provide the formulation of traditional RK4 and our inverse-temporal version in the appendix.

Experiments
For each of the hierarchical normalising flows (NFs) and diffusion models (DMs), we first show the effectiveness of likelihood estimation to analyse the image distribution (on 2D images for NFs and point samples for DMs).For likelihood estimation with image fidelity, we then illustrate the quality of images generated by our generation models (sampling on the manifold from a Gaussian distribution as well as resolution enhancement in NFs and sampling step exploration with RK4 stabilisation in DMs).

Experiments on Hierarchical Normalising Flow Models
Probability Estimation.Fig. 2 illustrates the probability density estimation on level 3 for an indistribution dataset CelebA (Liu et al., 2015) and an out-of-distribution dataset CIFAR10 (Krizhevsky, 2009).The distribution of the latent variable z i of CelebA is concentrated on a higher mean value than that of CIFAR10 due to the learning of z i in the standard Gaussian distribution.Similarly, this distribution tendency is not changed in the image space illustrated by log p(x i ).In this case, outlier samples from the in-distribution dataset can be detected with a small probability in the probability estimation.Random Image Generation.Image reconstructions with encoded latent variables and conditional images as well as random samples are provided in Fig. 3.For the low-level autoencoder and flow, say at level 1, conditioned on the sequence of decoded x i for i = {2, 3, 4}, the reconstruction of x 1 is close to the processed images although some human facial details are lost due to the downsampling mechanism, see Fig. 3(A).While randomly sampling {z i } from the normal distribution at each level, the generated human faces are smooth but with blurry details in such as hair and chin and lack a realistic background.
Image Super-resolution.With the jointly trained autoencoders and flows on CelebA, the images with low resolution, 3 × 8 × 8 (channel×height×width) and 3 × 16 × 16, are decoded to 3 × 64 × 64 with smooth human faces, see Fig. 4(A) and Fig. 4(B) respectively.The low-resolution image x i is used as a condition image for 1) NF inverse {g ′ i } to generate embedding code to combine with the randomly sampled z i ∼ N (0, 1) and 2) decoders {f ′ i } to concatenate with all upsampling layers in each decoder.This preserves the human facial details from either high levels or low levels for realistic image generation.As the resolution of the low-resolution images increases, the embedding code contains richer details.

Experiments on Diffusion Models
Log-likelihood Estimation on Point Samples.We evaluate the log-probability of each point of point  DDPM Sampling with Large Steps.While Fig. 5 uses h = 1 as the standard DDPM sampling process, it is feasible to sample with a fairly large step without losing the sample quality.This enables sampling from the Gaussian distribution for the log-likelihood estimation with less running time.To visualise the image quality, we evaluate the samples on CelebA dataset by using a pretrained diffusion model with 1,000 forward diffusion steps.In Fig. 6, the sampling has an increase step h in {2, 10, 100} while the samples have a high quality for h = {2, 10} and a fair quality for h = 100.
Higher-order Solution Stabilises Sampling.While sampling with a large step h can sometimes cause bias from the one with a small h, RK4 effectively alleviates such a bias.We evaluate both the point samples and human face images from CelebA.In Fig. 7, compared with the sample by using DDPM, RK4 with DDPM inference achieves less noise at h = {2, 5, 10}.For h = 20, RK4 performs expectedly worse because it only applies 5 sampling steps while the training is on (T = 100) diffusion steps.In Fig. 8, we apply DDIM as the inference method for RK4 to deterministically compare the samples with DDIM.As h increases from 1 to 100, many of the samples using DDIM lose the image consistency with the samples at h = 1; however, most of the samples using RK4 still retain the image consistency.This indicates the robustness of applying RK4 with a large sampling step.100}.The RK4 sampling method is more robust than DDIM, especially at h = 100, with a higher image consistency than those at h = 1.
Semantics of object attributes are crucial for image distribution and spatial presentation.For instance, different shapes in Fig. 5 represent different objects while those closer to the seen samples have high likelihood; in Fig. 8 semantics such as human gender (see the 2nd row and 3rd column image with DDIM and RK4) are fundamental for controllable generation by sampling in high-density regions of specific semantic clusters on the manifold.These semantics, however, are usually entangled without independent distributions from each other for deterministic embedding sampling on the image manifold (Ling et al., 2022;Liu et al., 2018;Pastrana, 2022).Hence, regardless of image generation models, we exploit the popular and efficient variational autoencoder and introduce our GridVAE model for effective semantic disentanglement on the image manifold.

Formulation
A variational autoencoder (VAE) (Kingma and Welling, 2013) is a neural network that maps inputs to a distribution instead of a fixed vector.Given an input x, the encoder with neural network parameters ϕ maps it to a hidden representation z.The decoder with the latent representation z as its input and the neural network parameters as θ reconstructs the output to be as similar to the input x.We denote the encoder q ϕ (z|x) and decoder p θ (x|z).The hidden representation follows a prior distribution p(z).
With the goal of making the posterior q ϕ (z|x) close to the actual distribution p θ (z|x), we minimise the Kullback-Leibler divergence between these two distributions.Specifically, we aim to maximise the log-likelihood of generating real data while minimising the difference between the real and estimated posterior distribution by using the evidence lower bound (ELBO) as the VAE loss function L(θ, ϕ) = − log p θ (x) + D KL (q ϕ (z|x)||p θ (z|x)) = −E z∼q ϕ (z|x) log p θ (x|z) + D KL (q ϕ (z|x)∥p θ (z)) , (24) where the first term is the reconstruction loss and the second term is the regularisation for q ϕ (z|x) to be close to p θ (z).The prior distribution of z is often chosen to be a standard unit isotropic Gaussian, which implies that the components of z should be uncorrelated and hence disentangled.If each variable in the latent space is only representative of a single element, we assume that this representation is disentangled and can be well interpreted.
Emergent language (EL) (Havrylov and Titov, 2017) is hereby introduced as a language that arises spontaneously in a multi-agent system without any pre-defined vocabulary or grammar.EL has been studied in the context of artificial intelligence and cognitive science to understand how language can emerge from interactions between agents.EL has the potential to be compositional such that it allows for referring to novel composite concepts by combining individual representations for their components according to systematic rules.However, for EL to be compositional, the latent space needs to be disentangled (Chaabouni et al., 2020).Hence, we integrate VAE into the EL framework by replacing the sender LSTM with the encoder of the VAE noting that the default LSTM encoder will entangle the symbols due to its sequential structure where the previous output is given as the input to the next symbol.In contrast, the symbols can be disentangled with a VAE encoder.
To achieve disentangled representations in EL, the VAE encoder must be able to cluster similar concepts into discrete symbols that are capable of representing attributes or concepts.The standard VAEs are powerful, but their prior distribution, which is typically the standard Gaussian, is inferior in clustering tasks, particularly the location and the number of cluster centres.In the EL setting, we desire a posterior distribution with multiple clusters, which naturally leads to an MoG prior distribution with K components We choose the µ k to be located on a grid in a Cartesian coordinate system so that the posterior distribution clusters can be easily determined based on the sample's distance to a cluster centre.We refer to this new formulation as GridVAE, which is a VAE with a predefined MoG prior on a grid.The KL-divergence term in Eq. ( 24) can be re-written as The log probability of the prior can be easily calculated with the MoG distribution, and we only need to estimate the log probability of the posterior using a large batch size during training.By using a GridVAE, we can obtain a posterior distribution with multiple clusters that correspond to the same discrete attribute, while allowing for variations within the same cluster to generate different variations of the attribute.

Experiments
We evaluate the clustering and disentanglement capabilities of the proposed GridVAE model using a two-digit MNIST dataset (LeCun et al., 1998) consisting of digits 0 to 5. Each digit is from the original MNIST dataset, resulting in a total of 36 classes [00, 01, 02, ..., 55].
To extract features for the encoder, we use a 4-layer ResNet (He et al., 2016) and its mirror as the decoder.The VAE latent space is 2-dimensional (2D), and if the VAE learns a disentangled representation, each dimension of the 2D latent space should represent one of the digits.We use a 2D mixture of Gaussian (MoG) as the prior distribution, with 6 components in each dimension centred at integer grid points from [-2, -1, 0, 1, 2, 3], that is the coordinates for the cluster centres are [(-2, -2), (-2, -1), ..., (3, 3)].The standard deviation of the mixture of Gaussian is 1/3.After training the model, we generate a scatter plot of the test set latent space, as shown in Fig. 9. Since the prior is a mixture of Gaussian on the grid points, if the posterior matches the prior, we can simply draw a boundary in the middle of two grid points, illustrated by the red lines in Fig. 9.
With the trained model, one can sample in the latent space for image generation.In Figs. 10 (A)-(B), when we decode from the cluster centres (i, j): in (A) we keep j = 0 and change i from −2 to 3, while in (B) we keep i = 0 and change j from −2 to 3. The latent space is disentangled with respect to the two digits -the first dimension of the latent space controls the first digit, while the second dimension controls the second digit.Each of the cluster centres corresponds to a different number.
Figs. 10(C)-(D) show images generated within the cluster centred at (1, 1), that is the pairs of number "44".If we slightly modify one of the dimensions, it corresponds to different variations of the number "4" along this dimension, while keeping the other digit unchanged.
Overall, these results demonstrate the effectiveness of the proposed GridVAE model in clustering and disentangling the latent space on the two-digit MNIST dataset.

Scaling Up GridVAE
In Sec.3.1, the two-digit MNIST dataset lies in a 2-dimensional latent space.However, many real-world datasets would require a much higher dimensional space.

Addressing Higher Dimensional Latent Space
Discretising a continuous space, such as in GridVAE, is challenging due to the curse of dimensionality (Bellman, 1957).This refers to the exponential growth in the number of clusters as the number of dimensions increases, which leads to a computational challenge when dealing with highdimensional latent space.For example, when applying GridVAE to reconstruct images of the CelebA (Liu et al., 2015) dataset to learn the 40 attributes, we need a 40-dimensional latent space with two clusters in each dimension to represent the presence or absence of a given attribute.Firstly, parametrising the mixture of Gaussian prior p(z) = K k=1 N (z|µ k , σ 2 k )/K over 40 dimensions is prohibitively expensive as K = 2 40 ≈ 1.1 × 10 12 .Secondly, the assumption of equal probability for the components, which was appropriate for the simple 2-digit MNIST dataset, is no longer valid.This is because the attributes in the CelebA dataset are not uniformly distributed, and some combinations may not exist.For instance, the combination of "black hair" + "blonde hair" + "brown hair" + "bald" is impossible due to attribute conflicts.To address this issue, we use the proposed loss function in Eq. ( 24) incorporating relaxation.To avoid pre-parametrising p(z) over 40 dimensions, we have implemented a dynamic calculation of the KL-divergence between q ϕ and p θ , whereby only the cluster that is closest to the latent space representation is considered, as illustrated in Fig. 11.This means that clusters to which the data point does not belong do not affect its distribution, and the MoG distribution is simplified to a multivariate Gaussian as where , and Σ 2 = diag(σ 2 0 , . . ., σ 0 n ) with the round function R(•) for the closest integer.
The key step here is that the round function dynamically selects the cluster centre closest to µ 1 , and σ 0 is a pre-defined variance for the prior distribution.It should be chosen so that two clusters next to each other have a reasonable degree of overlap, for example, σ 0 = 1/16 in some of our following experiments.The KL-divergence term becomes By adopting Eq. ( 28), we can significantly reduce the computational complexity of the model, even for a high-dimensional latent space, bringing it to a level comparable to that of a standard VAE.It is worth noting that the global disentanglement may no longer be guaranteed.Rather, the model only provides local disentanglement within the proximity of each cluster.
Upon training the GridVAE with a 40-dimensional latent space by using the proposed Eq. ( 28) on the CelebA dataset, we observe some intriguing disentanglement phenomena.Fig. 12 showcases the disentanglement of two latent space dimensions, where the first dimension governs one attribute and the second dimension determines another one.Combining these two dimensions leads to simultaneous attribute changes in the generated images.
An inherent limitation of this unsupervised approach is that while the latent space appears to be locally disentangled for each image, the same dimension may have different semantic interpretations across different images.To address this issue, we introduce all 40 attributes of the dataset during the training.This should establish an upper bound on the disentanglement.

From Unsupervised to Guided and Partially Guided GridVAE
To this end, we described an unsupervised approach to learning the latent space representation of images.However, for datasets like CelebA with ground truth attributes, we can incorporate them into the latent space to guide the learning.Specifically, we extract the 40-dimensional attribute vector indicating the presence or absence of each feature for each image in a batch and treat it as the ground truth cluster centre µ gt i .Hence, instead of rounding the latent space representation µ i in Eq. ( 28), we replace it with µ gt i .One limitation of this approach is the requirement of the ground truth attributes for all images, which may not always be available or feasible.Additionally, it is important to note that while we refer to this approach as "guided", the given attribute information only serves in the latent space as the cluster assignment prior, and the VAE reconstruction task remains unsupervised.This differs from classical supervised learning, where the label information is the output.Furthermore, in our approach, no specific coordinate in the latent space is designated for the input.Instead, we provide guidance that the sample belongs to a cluster centred at a certain point in the latent space.
This guided learning framework can be extended to a subset of the 40 attributes or a latent space with more dimensions.For clarity, we will refer to the latter as "partially guided" to distinguish it from the commonly used "semi-supervised" by using a subset of the labelled dataset.
We conduct the experiments using attribute information as latent space priors and obtain the following findings for the guided approach: (a) GridVAE is able to cluster images accurately based on their attributes and the same dimension has the same semantic meaning across different images.For instance, dimension 31 represents "smile".(b) GridVAE could not generate images for clusters that have little or no representation in the training set.For example, the attempt to generate an image of a bald female by constraining GridVAE to the "female" and "bald" clusters is not achievable for an accurate representation.(c) Some attributes are more universal across different images, such as their ability to add a smile to almost any face.However, other attributes, such as gender, are not always modifiable.This could be caused by attributes that are not independent and can be influenced by others.Universal attributes, such as "smile", seem to primarily located locally in the image region without interruption from the other attributes, see Fig. 13.
To further illustrate the incompleteness and correlation among the attributes in the CelebA dataset, we use a subset of the given attributes.We choose 38 out of the 40 attributes, excluding attributes 20 (female/male) and 31 (not smiling/smiling).Fig. 14 shows that the GridVAE cannot learn the omitted attributes.This highlights the interdependence of different attributes in the latent space.

Combining Manifolds of GridVAE Disentangled Attribute and Facial Recognition
After achieving a disentangled latent space, one may still wonder about the usefulness of a semantic description of a manifold.One can consider the scenario where another manifold, such as a facial recognition manifold, is learned.By studying these two manifolds jointly, we can gain insights to make the models more explainable and useful.One potential application is to better understand the relationship between facial attributes and facial recognition.By analyzing the disentangled latent space of facial  attributes and the manifold learned for facial recognition, we can potentially identify which attributes are the most important for recognising different faces.This understanding can then be used to improve the performance of facial recognition models as well as explain the model decisions.
For instance, FaceNet (Schroff et al., 2015) directly learns a mapping from face images to a compact Euclidean space where distances correspond to a measure of face similarity.To discover the semantic structure of this manifold with x as binary attributes, we can follow these steps: 1. Build a face recognition manifold using contrastive learning.
2. Use the CelebA dataset with ground truth attribute labels (40 binary values).
4. Find the nearest neighbour for each CelebA sample using the face recognition manifold coordinates.5.For each attribute in x, compute p(x) over the entire CelebA dataset.6.For each attribute in x, compute p(x|x of nearest neighbour = 0).7.For each attribute in x, compute the KL divergence between p(x) and p(x|x of nearest neighbour = 0).8. Identify attributes with the largest KL divergence.Fig. 15 demonstrates that the KL divergence between p(x) and p(x|x of nearest neighbour = 0) is significantly larger for certain attributes, such as "male", "wearing lipstick", "young" and "no beard", than the others.This indicates that the neighbourhood structure of the facial recognition manifold is markedly different from the distribution of these attributes in the entire dataset.These findings highlight   the importance of the joint study of different manifolds to gain a more profound understanding of the relationship between the attributes and the recognition tasks.By incorporating it into the models, we can potentially improve the performance of facial recognition models and also enhance their interpretability.

APPLICATION TO DEFEND PATCH ATTACKS
To this end, interpretable and controllable samplings from each semantic distribution on the manifold can be achieved by using the semantic disentanglement in Section 3 towards high-fidelity and diverse image generation and probability distribution analysis in Section 2. It is also of strong interest to enhance the robustness of such semantic samplings under certain attacks.In this section, we present an adversarial robustness framework by enforcing the semantic consistency between the classifier and the decoder for reliable density estimation on the manifold.

Adversarial Defence with Variational Inference
In (Yang et al., 2022), adversarial robustness can be achieved by enforcing the semantic consistency between a decoder and a classifier (adversarial robustness does not exist in non-semantically consistent classifier-decoder).We briefly review the adversarial purification framework below.We define the realworld high-dimensional data as x ∈ R n which lies on a low-dimensional manifold M diffeomorphic to R m with m ≪ n.We define an encoder function f : R n → R m and a decoder function f † : R m → R n to form an autoencoder.For a point x ∈ M, f † and f are approximate inverses.We define a discrete label set L of c elements as L = {1, ..., c} and a classifier in the latent space as h : R m → L. The encoder maps the image x to a lower-dimensional vector z = f (x) ∈ R m and the functions f and h together form a classifier in the image space h(z) = (h • f )(x) ∈ L.
A classifier (on the manifold) is a semantically consistent classifier if its predictions are consistent with the semantic interpretations of the images reconstructed by the decoder.Despite that the classifiers and decoders (on the manifold) have a low input dimension, it is still difficult to achieve high semantic consistency between them.Thus, we assume that predictions and reconstructions from high data density regions of p(z|x) are more likely to be semantically consistent and we need to estimate the probability density in the latent space with the variational inference.
We define three sets of parameters: 1) ϕ parametrises the encoder distribution, denoted as q ϕ (z|x), 2) θ parametrises the decoder distributions, represented as p θ (x|z), and 3) ψ parametrises the classification head, given by h ψ (z).These parameters are jointly optimised with respect to the ELBO loss and the cross-entropy loss as shown in Eq. ( 29), where λ is the trade-off term between the ELBO and the classification loss.We provide the framework in Figs.16(A)-(B) for the two-stage procedure and the trajectory of cluster center change after introducing our purification over attacks in Fig. 16(C).By adopting this formulation, we notice a remarkable semantic consistency between the decoder and the classifier.Specifically, on Fashion-MNIST (Xiao et al., 2017), when making predictions on adversarial examples, if the predicted label is "bag", we observe that the reconstructed image tends to resemble a "bag" as well.This phenomenon is illustrated in Fig. 16(D) and Fig. 17.
To defend against image-level attacks, a purification vector can be obtained through the test-time optimisation over the ELBO loss.For example, given an adversarial example x adv , a purified sample can be obtained by x pfy = x adv + ϵ pfy with ϵ pfy = arg max where C pfy = {ϵ ∈ R n | x adv + ϵ ∈ [0, 1] n and ∥ϵ∥ p ≤ ϵ th } which is the feasible set for purification and ϵ th is the purification budget.Since the classifier and the decoder are semantically consistent, the predictions from the classifier become normal to defend against the attacks upon normal reconstructions.

Bounded Patch Attack
In this work, we focus on the ℓ 0 -bounded attacks (Brown et al., 2017;Papernot et al., 2016) from the manifold perspective which is not investigated in the prior work.In contrast to full image-level attacks like ℓ 2 and ℓ ∞ bounded attacks (Madry et al., 2018), patch attacks, which are ℓ 0 bounded attacks, aim to restrict the number of perturbed pixels.These attacks are more feasible to implement in real-world settings, resulting in border impacts.Below, we conduct an initial investigation into the defence against patch attacks by leveraging the knowledge of the data manifold.
When compared to ℓ ∞ attacks, ℓ 0 attacks, such as the adversarial patch attacks, introduce larger perturbations to the perturbed pixels.Therefore, we decide to remove the purification bound for the patch-attack purification.Without these constraints, the purified examples can take on any values within the image space.A purification vector can then be obtained through the test-time optimisation over the ELBO loss as shown in Eq. (30).We use the gender classification model (Yang et al., 2022) to demonstrate the adversarial purification of ℓ 0 bounded attacks.To ensure that the adversarial examples do not alter the semantic content of the images, we restrict the perturbation region to the forehead of a human face.The patch for perturbation is a rectangular shape measuring 16 × 32, see Fig. 17.For the patch attacks, we conduct 2,048 iterations with step size 1/255 using PGD (Madry et al., 2018) and PGD-NAG (Nesterov Accelerated Gradient) (Lin et al., 2020).In Table 1, the purification is carried out through 256 iterations with the same step size.

LIMITATION
The current version of log-probability estimation in diffusion models has limitations in evaluating highdimensional images.Specifically, at early denoising steps (when t is small) the diffusion model serves as a denoiser such that x t and x t+h are similar while at large steps (when t moves towards T ), their difference is still small due to the high proportion of the Gaussian noise in x t .This leads to the proportion of the difference between x t and x t+h for effective out-of-distribution detection small compared with the log p accumulated in the processes.We keep this as an open problem for future work.

CONCLUSION
This work studies the image geometric representation from high-dimensional spatial space to lowdimensional latent space on the image manifold.To explore the image probability distribution with the assumption that real images are usually in a high-density region while not all samples from the distribution can be represented as realistic images, we incorporate log-likelihood estimation into the procedures of normalising flows and diffusion models.Meanwhile, we explore the hierarchical normalising flow structure and a higher-order solution in diffusion models for high-quality and high-fidelity image generation.For an interpretable and controllable sampling from the semantic distribution on the manifold, we then propose GridVAE model under an EL framework to disentangle the elements of the latent variable on the image manifold.To test the semantic and reconstruction robustness on the manifold, we first apply patch attacks and defences in the image space and then effectively recover the semantics under such attacks with our purification loss.Experiments show the effectiveness of probability estimation in distinguishing seen examples from unseen ones, the quality and the efficiency with large sampling steps in image generation, meaningful representations of varying specific element(s) of the latent variable to control the object attribute(s) in the image space, and the well-preserved semantic consistency with patch attacks.

CONFLICT OF INTEREST STATEMENT
Authors Peter Tu, Zhaoyuan Yang, and Yiwei Fu are employed by General Electric.The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

APPENDIX 1 THE RUNGE-KUTTA METHOD FOR DIFFUSION MODELS
We first revisit the Runge-Kutta method (RK4) which solves initial value problems (Runge, 1895;Kutta, 1901;Wikipedia, 2023).Given an initial value problem as f (x t , t) = dx t /dt, where x t is associated with time t, the estimation of x t at time (t + h) with step size h is computed by where For an initial value x 0 at time (t = 0), one can estimate x T iteratively by using Eq. ( 31) from (t = 0) to the terminate time (t = T ).
For the backward process of a diffusion model, the reverse step can be written as where g(•) refers to the backward step in DDPM (Ho et al., 2020) with Gaussian noises or DDIM (Song et al., 2021) without Gaussian noises and ϵ t−h is the reversing gradient from x t to x t−h , denoted as "model prediction".To apply RK4 to the reverse step, we follow the same rule as Eq. ( 31) but change the moving step as where where ⌊•⌋ rounds to the smaller integer due to the discretization of the sampling time space.The difference of moving steps between Eq. (37) and Eq. ( 31) is the multiplication of h or (h/2) to k i for i = {1, 2, 3, 4}.
Empirically, applying these multiplications in diffusion sampling leads to strongly unrealistic image generation even with small h.We thus hypothesize that the moving steps are already considered in g(•) because of the joint effect of the sampling coefficients in g(•), ϵ t , ϵ t−⌊h/2⌋ , and ϵ t−h .

Figure 2a .Figure 2 .
Figure 2a.log p(z) Figure 2b.log p(y) Figure 2c.log p(x) Figure 2. Log-likelihood estimation using hierarchical autoencoders and flows.The encoder and flow are trained on CelebA and evaluated on CelebA and CIFAR10.The x-axis is log p(•) and the y-axis is the histogram density.In each subfigure, the first row is on the in-distribution dataset CelebA and the second row is on out-of-distribution CIFAR10, both are in the last row.In (A), log p(z) can detect outlier samples, and adding log | det(•)| from NF and autoencoder does not significantly affect the distribution tendency, see (B) and (C).For better visualisation, samples with log p(•) less than -10,000 are filtered out.

Figure 3a .
Figure 3a.Reconstruction at level 1 with {z i } from encoders {g i } and conditioned on {f ′ i }.

Figure 3 .
Figure 3. Image reconstruction and generation on the end-to-end training of 4-level autoencoders and flows.For each of two columns from left to right in (A), the left is the real image and the right is the reconstructed image.

Figure 4 .
Figure 4. Image super-resolution on dataset CelebA.The first column is low-resolution images, the second column is real images, and the rest are high-resolution images with latent variables {Z i } ∼ N (0, 1) conditioned on the low-resolution images and temperature 1.0.

Figure 5 .
Figure 5. Evaluation of log-probability of x 0 on point samples with each of 10,000 points.(A) The training ison a Swiss roll sample and a diffusion model with forward (noising) and backward (denoising) processes.(B) At the evaluation phase, unseen samples, that is circle, moon, and S, have lower log p(x) values than the seen Swiss roll sample.In b, the first row is sampled points and the middle and last rows are the mean value and the standard deviation of log p(x) for each point on 100 random rounds respectively, which is represented as "mean ± std".The randomness lies in the random noise in the forward and backward processes.A lighter colour indicates a higher density.(C) Statistics indicates the higher density of a seen sample (Swiss roll) than an unseen one (circle, moon, or S) through the diffusion model by using the negative of Eq. (23) with log 10 .

Figure 7 .
Figure 7. Sampling robustness of DDPM and RK4 @ step h.With h being 5 or 10, RK4 still achieves clear sampling compared with DDPM.If h is too large, for instance 20, RK4 fails as expected.

Figure 9 .
Figure 9. Scatter plot of test set latent space with an MoG prior.

Figure 10 .
Figure 10.Generated images from sampling the latent space.(A) The second dimension is fixed at 0, changing the first dimension from -2 to +3. (B) The first dimension is fixed at 0 and the second dimension is changed from -2 to +3. (C) Around the cluster centre(1, 1), keep the second dimension fixed and change the first dimension.(D) Around the cluster centre(1, 1), keep the second dimension fixed and change the first dimension.

Figure 11 .
Figure 11.When calculating the KL-divergence, only the mixture component closest to the data (darker shade) is considered.Other components (lighter shade) are ignored.This can be generalised to multiple dimensions and multiple components in each dimension.

Figure 12 .
Figure 12.Two generated examples using linear sampling in the latent space, (A) left 3 rows and (B) right 3 rows.The top row fixes the dimensions and changes the first one, collar in (A) and skin color in (B), from -0.5 to +1.5.The middle row fixes the dimensions and changes the second one, hair color in (A) and hairstyle in (B), from -0.5 to +1.5.The bottom row changes the first and second dimensions from -0.5 to +1.5.

Figure 14 .
Figure 14.Partially guided GridVAE generation from the latent attributes which are not provided during training.The left and right subfigures (each with 11 images) are with the dimensions 20 and 31 respectively.

Figure 15 .
Figure 15.Semantic structure of the face recognition manifold by jointly studying the attribute manifold and the facial recognition manifold.

Figure 16 .
Figure 16.The framework of adversarial purification for image-level adversarial attacks.(A) Jointly train the classifier with the ELBO loss.(B) Test time adversarial purification with the ELBO loss.(C) Trajectories of clean (green) -attack (red) -purified (blue) images on a 2D latent space.(D) Input images and reconstruction images of samples in (C).The top two rows are the input and reconstruction of clean images, the middle two rows are the input and reconstruction of adversarial images.The bottom two rows are the input and reconstruction of purified images.The text predicted classes with green colour for correct predictions and red colour for incorrect predictions.The red box on the right corresponds to the failure case (purified process fails).

Figure 17 .
Figure 17.Class predictions from the VAE-Classifier models on clean, adversarial and purified samples of the CelebA gender attribute.The top two rows are the input and reconstruction of clean images, the middle two rows are the input and reconstruction of patch adversarial images.The bottom two rows are the input and reconstruction of purified images.The text represents the predicted classes with green colour for correct predictions and red colour for incorrect predictions.Since predictions and reconstructions from the VAE classifier are correlated, our test-time defences are effective against adversarial attacks.

Table 1 .
Classification accuracy of the model on clean and adversarial (patch) examples.