Leveraging diffusion models for unsupervised out-of-distribution detection on image manifold

Out-of-distribution (OOD) detection is crucial for enhancing the reliability of machine learning models when confronted with data that differ from their training distribution. In the image domain, we hypothesize that images inhabit manifolds defined by latent properties such as color, position, and shape. Leveraging this intuition, we propose a novel approach to OOD detection using a diffusion model to discern images that deviate from the in-domain distribution. Our method involves training a diffusion model using in-domain images. At inference time, we lift an image from its original manifold using a masking process, and then apply a diffusion model to map it towards the in-domain manifold. We measure the distance between the original and mapped images, and identify those with a large distance as OOD. Our experiments encompass comprehensive evaluation across various datasets characterized by differences in color, semantics, and resolution. Our method demonstrates strong and consistent performance in detecting OOD images across the tested datasets, highlighting its effectiveness in handling images with diverse characteristics. Additionally, ablation studies confirm the significant contribution of each component in our framework to the overall performance.


Introduction
The goal of out-of-distribution (OOD) detection is to ascertain if a given data point comes from a specific domain.This task is crucial given that machine learning models generally require that the distribution of test data mirrors the distribution of the training data.In cases where the test data deviates from the training distribution, the models can generate meaningless or deceptive results.This could be especially harmful for tasks in high-stake areas like healthcare (Hamet and Tremblay, 2017) and criminal justice (Rigano, 2019).
The OOD detection task has been examined under settings with access to varied amount of information.These settings can be categorized as supervised and unsupervised.Among supervised settings, the most informed scenario makes the assumption that exemplar out-of-domain data are available.One can then incorporate them in the training of neural networks to enhance their ability to recognize out-of-domain inputs (Hendrycks et al., 2018;Ruff et al., 2019).Various methods excel on identifying out-of-domain data when that resemble the training examples, but their performance deteriorates on out-of-domain

FIGURE
The intuition behind LMD.In essence, LMD leverages a di usion model as a mapping toward the in-domain manifold.It applies a mask to the image to lift it from its original manifold, and uses the di usion model to guide it toward the in-domain manifold.If an image is in-domain, it would generally have smaller distance between the original and mapped locations than out-of-domain images.

Materials and methods . Preliminaries
Problem formulation.Formally, we define the unsupervised out-of-distribution (OOD) task as follows: We aim to build a detector to identify data points x that deviate from a distribution of interest D. The detector should be built using only unlabeled data x 1 , • • • , x n sampled from D. It should assign an OOD score s(x) that positively correlates with the likelihood of x not belonging to D.
Diffusion models.In this section, we present a brief summary of the concepts behind the diffusion model (DM).It is a class of generative models that can learn complex distributions.It involves a forward process of diffusion and a backward process of denoising.Diffusion corrupts the original data with noise, while denoisingperformed by a learned neural network-progressively reduces noise from the corrupted image.There are various formulations of diffusion models, such as score-based generative models (Song and Ermon, 2019) and stochastic differential equations (Song et al., 2020).A comprehensive review can be found in Yang et al. (2022).
LMD is agnostic to the different DM variants.Here, we describe one prominent variant: the Denoising Diffusion Probabilistic Models (DDPMs) (Sohl-Dickstein et al., 2015;Ho et al., 2020).DDPM's diffusion process begins with a data sample x 0 , and injects Gaussian noise at every subsequent step t = 1, 2, • • • , T following Equation ( 1) where β t adheres to a predetermined variance schedule.The denoising process has a prior distribution x T ∼ N (0, 1), and formulates the process following Equation (2) where both µ θ (x t , t) and θ (x t , t) are parametrized by a neural network θ .
. Lift, Map, Detect Lift, Map, Detect (LMD) is inspired by the observation that a diffusion model maps images toward the manifold it is trained on.Concretely, it leverages a diffusion model trained over unlabeled in-domain data.Given a test image, LMD applies corruption techniques to lift it from its original manifold, and utilizes the diffusion model to map it toward the in-domain manifold on which the model is trained.As depicted in Figure 1, if the image is indeed in-domain, the model can map it back to its manifold close to its original location.Conversely, if the image belongs to a different manifold, then the diffusion model would redirect it toward the indomain manifold, moving it further away from its original location.Hence, out-of-domain images often have larger distance between the original and mapped images than in-domain images, and LMD identifies images with large distance as OOD. Figure 2 presents the general framework of LMD in Figure 2, and Algorithm 1 provides a succinct representation of the LMD algorithm.Subsequent sections explain each component of LMD in detail.

. . Lifting and mapping images
LMD lifts an image by masking parts of it, and maps it by inpainting over the masked area.For convenience, we also refer the lifted and mapped images as masked and reconstructed images, respectively.Masking provides a straightward way of controlling the extent to which an image is lifted, as larger masked area generally corresponds to larger deviation from the manifold.Furthermore, recent studies have shown that vanilla diffusion models can perform inpainting without the need for retraining, regardless of the size or shape of the masked regions.This highlights masking and inpainting as an intuitive strategy.Algorithm 2 describes the high-level process of inpainting with diffusion models.Additionally, we observe that an alternative way of lifting and mapping an image is to just add noise to it and then denoise with the diffusion model.We compare this instantiation with masking and inpainting in Table 4.
LMD operates based on the assumption that in-domain images have smaller reconstruction distance than out-of-domain images.
Input: original image x orig , binary mask M where 0 indicates region to be inpainted, diffusion model θ In practice, the validity of this assumption depends on two factors.First of all, inpainting with a diffusion model is stochastic.This occasionally leads to unfaithful in-domain reconstructions or faithful out-of-domain reconstructions.Consequently, a single reconstruction distance provides a noisy signal for identifying OOD images.To mitigate the randomness, we perform multiple reconstructions for each image, and use the median reconstruction distance as the OOD score.Our experiments in Section 3.4.3show that this can significantly improve the detection performance.
Another factor to consider is the amount of information removed from an image.In the extreme case where the whole image is masked out, the reconstruction would be a random image from the in-domain manifold.This could lead to large reconstruction distance for both in-domain and out-of-domain images, especially when the in-domain distribution is diverse.Conversely, if only one pixel is removed from an image, then both in-domain and outof-domain reconstructions would be highly faithful.Therefore, a mask should ideally provide sufficient clues for the diffusion model to map a lifted in-domain image close to its original location, while creating enough space to produce dissimilar out-of-domain reconstructions.
In this regard, we propose to use the alternating checkerboard N × N mask (Figure 3).For simplicity, we assume that images are square-shaped with size L × L; extension to rectangular-shaped images is straightforward.The checkerboard mask divides the image into an N × N grid of patches, where each patch has size L N × L N .It masks out every other patch in a checkerboardlike fashion, covering 50% of an image in total.During multiple reconstructions, the masked and unmasked patches are flipped at each reconstruction attempt.This ensures that salient characteristics of an out-of-domain images are covered at some attempts.We default to N = 8.Experiments with different values of N can be found in Table 2.

. . Measuring reconstruction distance
We use the Learned Perceptual Image Patch Similarity (LPIPS) (Zhang et al., 2018) metric to measure the distance between the original and reconstructed images.LPIPS utilizes calibrated intermediate activations of a pretrained neural network as features, and measures the normalized ℓ 2 distance between FIGURE Overview of the LMD process.LMD utilizes a di usion model trained over the in-domain manifold.It repeatedly lifts an image from its manifold by masking, and maps it toward the di usion model's training manifold by inpainting.It measures the median distance between the original and the mapped images, and considers images with larger distance as out-of-domain.

FIGURE
The alternating checkerboard mask.We flip the masked and unmasked regions at each reconstruction attempt.The example in the figure is × .
the features of two images.This yields a value between 0 and 1, where lower value indicates higher similarity.We employ the version with AlexNet (Krizhevsky et al., 2012) backbone pretrained on ImageNet.LPIPS has been observed to align with human perception of image similarity (Zhang et al., 2018), and has been applied in research on a wide range of tasks (Karras et al., 2019;Alaluf et al., 2021;Meng et al., 2021) and image modalities (Gong et al., 2021;Lugmayr et al., 2022;Toda et al., 2022).Experiments with alternative metric choices in Table 3.

. Experiment settings
We benchmark LMD against existing unsupervised OOD detection methods on widely used datasets.We provide finegrained analysis and visualizations of the reconstructed images to better understand LMD's performance.Additionally, we perform ablation studies to analyze the individual components of LMD.

. . Baselines
We compare LMD with seven existing baselines, covering three mainstream classes of methods: likelihood-based, reconstructionbased and feature-based.For likelihood-based methods, we consider Likelihood (Likelihood) (Bishop, 1994), Input Complexity (IC) (Serrà et al., 2019) and Likelihood Regret (LR) (Xiao et al., 2020).We obtain the likelihood from the diffusion model using Song et al. ( 2020)'s approach.We adapt the official GitHub repository of Likelihood Regret for both Likelihood Regret and Input Complexity.For Input Complexity, we leverage the likelihood from the diffusion model to ensure fairness in comparison; we have experimented with both the PNG compressor and the JPEG compressor, and we report the results from the PNG compressor due to its superior performance.(Pretrained) (Xiao et al., 2021).We use our own implementation as we did not find any existing implementation to our best efforts.

. . Evaluation
We evaluate the performance of LMD and the baselines using the area under Receiver Operating Characteristic curve (ROC-AUC), following the practice of existing works (Hendrycks and Gimpel, 2016;Ren et al., 2019;Xiao et al., 2021).OOD detection methods commonly produce numeric OOD scores, and apply a decision threshold to classify data as in-domain or out-of-domain.The ROC curve plots the true positive rate against the false positive rate at various decision thresholds, and ROC-AUC measures the area under the curve.ROC-AUC ranges between 0 and 1, with higher values indicating better performance.A detector achieves ROC-AUC > 0.5 when it in general assigns higher OOD scores to out-of-domain images than in-domain images.Conversely, it yields ROC-AUC < 0.5 when it in general assigns higher OOD scores for in-domain images.

. . Implementation details of LMD
We build LMD on top of Song et al. (2020)'s implementation.For datasets in Table 3, we use DDPM++ models with SubVP SDE.We take Song et al. (2020)'s pretrained CIFAR10 checkpoint, and train from scratch for the other datasets.We use alternating checkerboard 8 × 8 mask (Figure 3), reconstruction distance metric LPIPS and 10 reconstructions per image for LMD.
For the higher resolution datasets, we use NCSN++ models with VE SDE.We take Song et al. (2020)'s pretrained FFHQ (Karras et al., 2019) checkpoint for CelebA-HQ vs. ImageNet.This is to avoid model memorization concerns given that the CelebA-HQ checkpoint is pretrained over the whole dataset.We use Song et al. (2020)'s pretrained LSUN bedroom checkpoint for LSUN bedroom vs. LSUN classroom.For these datasets, we consider a checkerboard 4 × 4 mask, a checkerboard 8 × 8 mask and a square-centered mask, with one reconstruction per image.We additionally report the ROC-AUC from our default configuration of alternating 8 × 8 checkerboard and 10 reconstructions per image as a reference.We use LPIPS as the distance metric.

. Quantitative results and analysis
We present the OOD detection performance of LMD and the baselines on 12 dataset pairs in Table 1.LMD attains the highest ROC-AUC on five pairs, while demonstrating consistent and strong performance on others.Specifically, on CIFAR100 vs. SVHN, it attains 10% higher ROC-AUC than the best baseline performance.LMD also attains the highest average ROC-AUC of 0.907, which is 9% higher than the best average performance among the baselines.We visualize examples of the indomain and out-of-domain reconstructions of LMD in Figure 4.In general, in-domain reconstructions resemble their original images, while out-of-domain reconstructions are fragmented and noisy.
We further conduct fine-grained analysis to understand LMD's performance.We observe that each dataset in Table 1 consists of images from multiple distinct semantic categories, forming a diverse data distribution.For example, CIFAR10 comprises 10 different objects or animals, and SVHN comprises 10 digits.We seek to understand whether LMD performs similarly across different semantic categories, or if certain categories are more challenging for LMD than the others.Specifically, we group the images by their ground truth classes, and examine the distinguishability of the OOD scores for each pair of classes of the in-domain vs. out-of-domain datasets.We present the results for CIFAR10 vs. SVHN and SVHN vs. CIFAR10 in Figure 5. On CIFAR10 vs. SVHN, all pairs of classes are highly distinguishable, with ROC-AUC ranging from 0.97 to 1.This is unsurprising given that LMD attains strong performance of ROC-AUC 0.992 on this pair.On SVHN vs. CIFAR10, pairwise performance shows visible variation, with ROC-AUC ranging from 0.84 to 0.97.Specifically, the ROC-AUC is relatively low when the in-domain class is "3" or "5, " and when the out-of-domain class is "deer" or "frog."This suggests that the reason behind LMD's satisfactory but suboptimal performance on SVHN vs. CIFAR10 is primarily attributed to the relative difficulty in distinguishing between some of the semantic categories.

. Qualitative studies on higher resolution images
We show qualitative results on images with resolution 256×256 for two in-domain/out-of-domain pairs: CelebA-HQ vs. ImageNet (Figure 6) and LSUN bedroom vs. LSUN classroom (Figure 7).The ROC-AUCs in the images correspond to LMD's performance with only one reconstruction attempt.As a reference, under our default configuration of alternating checkerboard 8 × 8 mask and 10 reconstruction attempts, CelebA-HQ vs. ImageNet has a ROC-AUC of 0.993, and LSUN bedroom vs. LSUN classroom has a ROC-AUC of 0.927.
For CeleA-HQ vs. ImageNet, LMD performs competitively under all three mask choices, and achieves ROC-AUC ranging from 0.991 to 1 even without multiple reconstructions.Given the highly structured nature of human faces, the in-domain reconstructions under all three masks are accurate.For the out-of-domain images, reconstructions under the checkerboard masks contain local  distortions, while reconstructions under the center mask tend to hallucinate faces.As a result, in this case, the in-domain and out-ofdomain reconstructions become more discernible when employing larger patches in masking.
For LSUN bedroom vs. LSUN classroom, the checkerboard 8×8 mask attains strong results, while the checkerboard 4 × 4 mask and the center-squared mask demonstrate suboptimal performance.This is because bedroom images exhibit greater variation and contain more intricate details.Consequently, when large patches are masked, the diffusion model may fill in plausible yet different content, resulting in significant reconstruction discrepancies for in-domain images.In fact, even with the checkerboard 8 × 8 mask, the diffusion model may hallucinate or alter elements in the bedroom inpaintings.Moreover, the complex and diverse nature Results from these two dataset pairs collectively demonstrate that LMD could scale to higher resolution images with richer details.They also highlight the checkerboard 8 × 8 mask as a versatile default choice, as it is effective for both structured and diverse in-domain distributions.For further discussions on mask choices, please refer to Section 3.4.1.

. Ablation studies . . Mask choice
Table 2 presents the performance of LMD under alternative mask choices.Besides our default mask, we consider alternating checkerboard 4 × 4, alternating checkerboard 16 × 16, a fixed 8 × 8 checkerboard for which we do not perform the flipping operation, a square-centered mask, and a random patch mask following (Xie et al., 2022).Figure 8 visualizes the mask patterns.We experiment on three dataset pairs: CIFAR10 vs. CIFAR100, CIFAR10 vs. SVHN and MNIST vs. KMNIST.For all the mask choices, we perform 10 reconstructions per image and use LPIPS as the reconstruction distance metric.
Our default mask choice of alternating checkerboard 8 × 8 shows consistent and strong performance.Alternating checkerboard 16 × 16 mask, fixed checkerboard 8 × 8 mask and the random patch mask are competitive but underperform the https://github.com/microsoft/SimMIMdefault choice.Nevertheless, alternating checkerboard mask is recommended over fixed checkerboard mask or random patch mask, as it ensures that all parts of the image are covered in some of the reconstruction attempts.Alternating checkerboard 4 × 4 and square-centered masks show suboptimal performance on MNIST vs. KMNIST.This is because they mask out too much information from the images, and therefore lead to unfaithful reconstructions for both in-domain and out-of-domain images.

. . Reconstruction distance metric
We study the effect of using alternative metrics for measuring the reconstruction distance.We consider two popular metrics, Mean Squared Error (MSE) and Structural Similarity Index Measure (SSIM) (Wang et al., 2003), both of which have been widely used for image comparison (Zhang et al., 2019;Bhat et al., 2021;Saharia et al., 2022).We further observe that Xiao et al. (2021) uses features from a ResNet-50 pretrained with SimCLRv2 (Chen et al., 2020) on ImageNet, and achieves superior performance on CIFAR10 vs. CIFAR100.Thus, we also consider a SimCLRv2-based metric, in which we calculate the cosine distance between the SimCLRv2 features of the original and reconstructed images.
We present the performance of LMD under different distance metrics in Table 3. MSE and SSIM demonstrate poor performance when SVHN is the out-of-domain dataset.Our default choice LPIPS demonstrates strong and consistent performance, and attains the highest average ROC-AUC.SimCLRv2 is competitive but underperforms LPIPS.This suggests that deep feature based metrics are in general effective, and LPIPS is suitable as a default choice.

FIGURE
Examples of image reconstruction from CelebA-HQ (in-domain) and ImageNet (out-of-domain).For out-of-domain reconstructions, the checkerboard masks result in local inconsistencies, while the center mask hallucinates faces.In this case, employing larger masked patches slightly improves the performance.

. . Number of reconstructions per image
We examine LMD's performance under different number of reconstructions per image.Figure 9 plots the ROC-AUC against the number of reconstructions per image for MNIST vs. KMNIST and KMNIST vs. MNIST.LMD's performance improves as the number of reconstructions increases, regardless of the choice of distance metric.The improvement is especially obvious for the first 5 attempts, and gradually plateaus as the number of attempts approaches 10.This suggests that it is generally sufficient to perform 10 attempts per image.

. . Alternative instantiation of lifting and mapping
We observe that another intuitive way of lifting and mapping images with a diffusion model is to lift by diffusion to an intermediate step t in the noise schedule, and denoising back to the image distribution.We refer to this alternative instantiation as diffusion/denoising, and compare it with our default instantiation of masking/inpainting.Given that the image distribution is at t = 0 and the noise distribution is at t = T, the larger t we diffuse to, the further away we lift an image from the manifold.We consider different lifting distances with t = 250, t = 500, and t = 750, where the full schedule has T = 1000.We use our default alternating checkerboard 8 × 8 mask for masking/inpainting.We use 10 reconstructions per image and the LPIPS metric for both diffusion/denoising and masking/inpainting.
We present the performance in Table 4. Diffusion/denoising with t = 250 and t = 750 demonstrate suboptimal performance on several pairs, indicating that the lifting distance is too small or too large for the in-domain and out-of-domain to be distinguishable.t = 500 is competitive but underperforms masking/inpainting.This suggests that while LMD is robust to alternative choices of lifting and mapping, masking/inpainting is the recommended instantiation.

. . Alternative choices for the inpainting model
We perform qualitative evaluation on using other classes of inpainting models in the LMD framework.We consider Masked Autoencoder (MAE) (He et al., 2022) trained on CIFAR10, and LaMa (Suvorov et al., 2022), a GAN-based inpainting model, trained on CelebA-HQ.We perform one reconstruction per image, as both MAE and LaMa are deterministic.The alternating checkerboard 8 × 8 shows strong and consistent results.The bold values mean the best performance, i.e., the highest ROC-AUC, among the evaluated methods in each setting, where a setting is either an in-domain vs. out-of-domain dataset pair or the average across all dataset pairs.

FIGURE
Masks used in the mask ablation.The random patch mask in the figure is just one example; a di erent pattern is sampled each time.
Frontiers  Both models demonstrate lower performance than the diffusion model in various scenarios.Figure 10 shows LaMa's performance on CelebA-HQ vs. ImageNet.LaMa attains reasonable results, but it underperforms diffusion models.LaMa hallucinates faces with the center mask, but unlike the diffusion model, the color and texture of the hallucinated faces are very consistent with the surroundings.
Figure 11 shows MAE's performance on CIFAR10 vs. SVHN.Both in-domain and out-of-domain reconstructions are accurate when the individual masked patch sizes are small, while both deviate from the originals when the patch sizes are large.Performance-wise, inpainting with MAE only attains ROC-AUC 0.065 for checkerboard 8 × 8 mask, 0.178 for checkerboard 4 × 4 mask and 0.403 for center mask.
The suboptimal performance of alternative inpainting models can be attributed to their ability to leverage various sources of information-from not only its understanding of the training distribution, but also color or texture of unmasked parts of an image.Models like LaMa and MAE employ specialized loss functions and large masked ratios during training, and thus excel at inferring missing regions from known ones regardless of semantics.Consequently, these models are more prone to producing reasonable out-of-domain inpaintings, especially with simpler out-of-domain images.In contrast, a vanilla diffusion model is not specifically trained for inferring missing regions from the surroundings.It primarily relies on its understanding of the training distribution to perform inpainting, and thus attains robust performance.

Discussion . LMD's relationship with existing works
In the unsupervised setting, existing works generally follow one of the three paradigms: likelihood-based, reconstructionbased and feature-based.LMD is a reconstruction-based approach.Typically, reconstruction-based methods involve training a model using in-domain samples, and assessing the reconstruction quality of a test data point under the model.Prior works commonly use autoencoders (Sakurada and Yairi, 2014;Xia et al., 2015;Zhou and Paffenroth, 2017;Zong et al., 2018) or GANs (Schlegl et al., 2017;Li et al., 2018).One concurrent work (Graham et al., 2022) utilizes diffusion models, and considers image reconstructions under varying numbers of diffusion and denoising steps.This contrasts with LMD, which repeatedly performs masking and inpainting with fixed number of steps.These two approaches are orthogonal and complementary.The likelihood-based paradigm has been extensively explored, with early contributions dating back to Bishop (1994).The core idea is to approximate the in-domain distribution with a generative model that has likelihood computation capability (Salimans et al., 2017;Kingma and Dhariwal, 2018).Intuitively, the model should assign higher likelihood to in-domain data than out-of-domain data, but various studies have observed that such assumption often does not hold (Choi et al., 2018;Nalisnick et al., 2018;Kirichenko et al., 2020).One line of work addresses this issue under a typicality test framework (Ren et al., 2019;Serrà et al., 2019;Xiao et al., 2020).Essentially, they view likelihood as a model statistic rather than a literal measure of how likely a data point is in-domain.They examine the extent to which the model statistic of a test data point deviates from the typical distribution of model statistics for in-domain data.Notably, this is complementary to LMD, as the reconstruction distance can also be viewed as a model statistic.Other likelihood-based approaches include adjusting the likelihood by background likelihood (Ren et al., 2019), image complexity (Serrà et al., 2019) or the likelihood under optimal model configurations (Xiao et al., 2020), or improving the generative model architectures (Maaløe et al., 2019;Kirichenko et al., 2020).

. Limitation and future work
One limitation of LMD is the speed.Vanilla diffusion models have a time-consuming denoising process that involves a large number of sampling steps.Therefore, similar to other diffusionbased approaches for various tasks (Meng et al., 2021;Lugmayr et al., 2022;Saharia et al., 2022), LMD is currently not wellsuited for real-time OOD detection.Several recent works have proposed methods to accelerate the sampling process of pre-trained diffusion models through noise rescaling (Nichol and Dhariwal, 2021), sampler optimization (Watson et al., 2022), or numerical methods (Liu et al., 2022;Wizadwongsa and Suwajanakorn, 2023).One future direction is to harness these methods to expedite LMD's detection.
Another potential extension is to utilize more advanced methods for aggregating reconstruction distances from multiple reconstructions, or even under different masks or distance metrics.As briefly discussed in Section 4.1, this can involve integrating typicality test approaches such as multiple hypothesis testing or learning density models (Nalisnick et al., 2019;Morningstar et al., 2021;Bergamin et al., 2022).

Conclusion
We propose a novel method, Lift, Map, Detect (LMD), for unsupervised out-of-distribution detection.LMD leverages the diffusion model's strong ability in mapping images onto its training manifold, and detects images with large distance between the original and mapped images as OOD.Our extensive experiments and analysis show that LMD achives strong performance for various image distributions with different characteristics.Some future directions of improvement include accelerating LMD's speed and leveraging advanced aggregation for reconstruction distance.

FIGURE
FIGUREExample reconstructions from three pairs of dataset."Orig." is the original image and "Inp." is the inpainted image.Generally, the in-domain reconstructions are faithful while the out-of-domain reconstructions are noisy and dissimilar.

FIGURE
FIGUREReconstruction examples from LSUN bedroom (in-domain) and LSUN classroom (out-of-domain).As bedroom images are diverse and contain richer details, a mask with smaller patches is preferrable.

FIGURE
FIGURE ROC-AUC vs. the number of reconstruction attempts.More reconstruction attempts enhances the OOD detection performance, irrespective of the distance metric.(A) MNIST vs. KMNIST.(B) KMNIST vs. MNIST.

FIGURE
FIGUREReconstruction examples from CelebA-HQ (in-domain) and ImageNet (out-of-domain) using LaMa, a GAN-based inpainting model.Unlike the di usion model, LaMa produces less visible artifacts.Even though it also introduces face-like artifacts with the center mask, the faces have the colors and textures of the surrounding unmasked regions.

FIGURE
FIGUREReconstruction examples from CIFAR (in-domain) and SVHN (out-of-domain) using MAE.Di erentiating between in-domain and out-of-domain inpaintings are hard, because reconstructing SVHN from only known regions is relatively simple, and because MAE is trained to have strong capability of inference from known regions.

TABLE ROC -
AUC of LMD and the baselines.
Higher value is better.We use the default configuration of alternating checkerboard 8 × 8, LPIPS metric and 10 reconstructions per image for all experiments.LMD consistently demonstrates strong performance and attains the highest average ROC-AUC.The bold values mean the best performance, i.e., the highest ROC-AUC, among the evaluated methods in each setting, where a setting is either an in-domain vs. out-of-domain dataset pair or the average across all dataset pairs.

TABLE Performance of
ROC-AUC on three dataset pairs with di erent mask types.

TABLE ROC -
AUC performance under di erent reconstruction distance metrics.LPIPS demonstrates consistent and robust results, while other metrics exhibit performance fluctuations.The bold values mean the best performance, i.e., the highest ROC-AUC, among the evaluated methods in each setting, where a setting is either an in-domain vs. out-of-domain dataset pair or the average across all dataset pairs.
in Artificial Intelligence frontiersin.org

TABLE ROC -
AUC performance of using di usion/denoising vs. masking/inpainting.Diffusion/denoising with t = 500 achieves reasonable performance but underperforms diffusion/inpainting.The bold values mean the best performance, i.e., the highest ROC-AUC, among the evaluated methods in each setting, where a setting is either an in-domain vs. out-of-domain dataset pair or the average across all dataset pairs.