<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Artif. Intell.</journal-id>
<journal-title>Frontiers in Artificial Intelligence</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Artif. Intell.</abbrev-journal-title>
<issn pub-type="epub">2624-8212</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/frai.2024.1255566</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Artificial Intelligence</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Leveraging diffusion models for unsupervised out-of-distribution detection on image manifold</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes" equal-contrib="yes">
<name><surname>Liu</surname> <given-names>Zhenzhen</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<xref ref-type="author-notes" rid="fn001"><sup>&#x02020;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2238103/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes" equal-contrib="yes">
<name><surname>Zhou</surname> <given-names>Jin Peng</given-names></name>
<xref ref-type="corresp" rid="c002"><sup>&#x0002A;</sup></xref>
<xref ref-type="author-notes" rid="fn001"><sup>&#x02020;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2373209/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Weinberger</surname> <given-names>Kilian Q.</given-names></name>
</contrib>
</contrib-group>
<aff><institution>Department of Computer Science, Cornell University</institution>, <addr-line>Ithaca, NY</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Yunye Gong, SRI International, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Pavan Turaga, Arizona State University, United States</p>
<p>Ankita Shukla, University of Nevada, Reno, United States</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Zhenzhen Liu <email>zl535&#x00040;cornell.edu</email></corresp>
<corresp id="c002">Jin Peng Zhou <email>jz563&#x00040;cornell.edu</email></corresp>
<fn fn-type="equal" id="fn001"><p>&#x02020;These authors have contributed equally to this work and share first authorship</p></fn></author-notes>
<pub-date pub-type="epub">
<day>09</day>
<month>05</month>
<year>2024</year>
</pub-date>
<pub-date pub-type="collection">
<year>2024</year>
</pub-date>
<volume>7</volume>
<elocation-id>1255566</elocation-id>
<history>
<date date-type="received">
<day>09</day>
<month>07</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>25</day>
<month>03</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2024 Liu, Zhou and Weinberger.</copyright-statement>
<copyright-year>2024</copyright-year>
<copyright-holder>Liu, Zhou and Weinberger</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Out-of-distribution (OOD) detection is crucial for enhancing the reliability of machine learning models when confronted with data that differ from their training distribution. In the image domain, we hypothesize that images inhabit manifolds defined by latent properties such as color, position, and shape. Leveraging this intuition, we propose a novel approach to OOD detection using a diffusion model to discern images that deviate from the in-domain distribution. Our method involves training a diffusion model using in-domain images. At inference time, we lift an image from its original manifold using a masking process, and then apply a diffusion model to map it towards the in-domain manifold. We measure the distance between the original and mapped images, and identify those with a large distance as OOD. Our experiments encompass comprehensive evaluation across various datasets characterized by differences in color, semantics, and resolution. Our method demonstrates strong and consistent performance in detecting OOD images across the tested datasets, highlighting its effectiveness in handling images with diverse characteristics. Additionally, ablation studies confirm the significant contribution of each component in our framework to the overall performance.</p></abstract>
<kwd-group>
<kwd>out-of-distribution detection</kwd>
<kwd>diffusion models</kwd>
<kwd>score-based models</kwd>
<kwd>generative modeling</kwd>
<kwd>manifold learning</kwd>
</kwd-group>
<counts>
<fig-count count="11"/>
<table-count count="4"/>
<equation-count count="2"/>
<ref-count count="69"/>
<page-count count="14"/>
<word-count count="8426"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Machine Learning and Artificial Intelligence</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1 Introduction</title>
<p>The goal of out-of-distribution (OOD) detection is to ascertain if a given data point comes from a specific domain. This task is crucial given that machine learning models generally require that the distribution of test data mirrors the distribution of the training data. In cases where the test data deviates from the training distribution, the models can generate meaningless or deceptive results. This could be especially harmful for tasks in high-stake areas like healthcare (Hamet and Tremblay, <xref ref-type="bibr" rid="B13">2017</xref>) and criminal justice (Rigano, <xref ref-type="bibr" rid="B40">2019</xref>).</p>
<p>The OOD detection task has been examined under settings with access to varied amount of information. These settings can be categorized as supervised and unsupervised. Among supervised settings, the most informed scenario makes the assumption that exemplar out-of-domain data are available. One can then incorporate them in the training of neural networks to enhance their ability to recognize out-of-domain inputs (Hendrycks et al., <xref ref-type="bibr" rid="B16">2018</xref>; Ruff et al., <xref ref-type="bibr" rid="B41">2019</xref>). Various methods excel on identifying out-of-domain data when that resemble the training examples, but their performance deteriorates on out-of-domain inputs that are not represented in the training process. In practical applications, inputs are often highly diverse, and it is challenging to construct a truly representative set of out-of-domain examples. A more feasible setting is to only leverage in-domain classifiers or class labels (Hendrycks and Gimpel, <xref ref-type="bibr" rid="B15">2016</xref>; Liang et al., <xref ref-type="bibr" rid="B29">2017</xref>; Lee et al., <xref ref-type="bibr" rid="B27">2018</xref>; Huang et al., <xref ref-type="bibr" rid="B19">2021</xref>; Wang et al., <xref ref-type="bibr" rid="B55">2022</xref>). Although this setting is less restrictive, it still requires two essential conditions: well-defined categorization of the in-domain data and an adequate amount of labeled data. These conditions do not hold for many tasks. In contrast, the fully unsupervised setting only require access to unlabeled in-domain data, which can often be obtained with low cost and in abundant quantities. As a result, it is ideal to develop OOD detectors under the fully unsupervised setting.</p>
<p>Recently, the diffusion models (DMs), a type of generative models, have received increasing attention in the machine learning community (Ho et al., <xref ref-type="bibr" rid="B18">2020</xref>; Song et al., <xref ref-type="bibr" rid="B51">2020</xref>). DMs operate on two procedures: The forward operation performs iterative noise addition to an image&#x00027;s pixels and transforms it into a sample drawn from a noise distribution. The backward operation&#x02014;performed by a dedicated neural network&#x02014;gradually removes noise from the image, guiding a noise image toward a specific image manifold.</p>
<p>In this paper, we show that we can leverage DMs as a mapping to a manifold, and use it for unsupervised OOD detection. Conceptually, if an image is lifted from its manifold, a diffusion model trained over the same manifold can guide it back to its original manifold. However, if the diffusion model has been trained on a different manifold, it would lead the lifted image toward its own training manifold, resulting in a substantial distance between the original and the mapped images. Therefore, we can identify out-of-domain images based on this distance.</p>
<p>To this end, we introduce an innovative unsupervised method for out-of-distribution detection, <bold>L</bold>ift, <bold>M</bold>ap, <bold>D</bold>etect (LMD), that embodies the aforementioned concept. <bold>Lifting</bold> is performed through image corruption. For instance, a face image that has been masked in the center will no longer fit into the face image category. Previous research by Song et al. (<xref ref-type="bibr" rid="B51">2020</xref>) and Lugmayr et al. (<xref ref-type="bibr" rid="B31">2022</xref>) have demonstrated that the diffusion model can perform inpainting, i.e., restoring missing areas in an image with visually convincing content, without the need for additional training. This allows us to <bold>map</bold> the mapped image via inpainting with a in-domain diffusion model. We can employ a conventional image similarity metric to calculate the distance between the original and mapped images, and <bold>detect</bold> an out-of-domain image when there is a significant distance. In <xref ref-type="fig" rid="F1">Figure 1</xref>, we provide an example: A diffusion model trained with face images maps a lifted in-domain face image closer to its original location, while moving an lifted fire hydrant, an out-of-domain image, further away.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>The intuition behind LMD. In essence, LMD leverages a diffusion model as a mapping toward the in-domain manifold. It applies a mask to the image to lift it from its original manifold, and uses the diffusion model to guide it toward the in-domain manifold. If an image is in-domain, it would generally have smaller distance between the original and mapped locations than out-of-domain images.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-07-1255566-g0001.tif"/>
</fig>
<p>Our main contributions include: (1) We propose an innovative unsupervised OOD detection technique, <italic>Lift, Map, Detect</italic> (LMD), that utilizes of the inherent manifold mapping capacity of diffusion models, and incorporates design choices that enhance the distinguishability between in-domain and out-of-domain data. (2) We conduct extensive experiments on various image datasets with different characteristics to illustrate the versatility of LMD. (3) We present in-depth analysis, visualizations and ablations to confirm LMD&#x00027;s underlying hypothesis and provide insights into LMD&#x00027;s behaviors.</p>
</sec>
<sec sec-type="materials and methods" id="s2">
<title>2 Materials and methods</title>
<sec>
<title>2.1 Preliminaries</title>
<p><bold>Problem formulation</bold>. Formally, we define the unsupervised out-of-distribution (OOD) task as follows: We aim to build a detector to identify data points <bold>x</bold> that deviate from a distribution of interest <inline-formula><mml:math id="M1"><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:math></inline-formula>. The detector should be built using only unlabeled data <bold>x<sub>1</sub></bold>, &#x022EF;&#x000A0;, <bold>x<sub>n</sub></bold> sampled from <inline-formula><mml:math id="M2"><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:math></inline-formula>. It should assign an OOD score <italic>s</italic>(<bold>x</bold>) that positively correlates with the likelihood of <bold>x</bold> <italic>not</italic> belonging to <inline-formula><mml:math id="M3"><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:math></inline-formula>.</p>
<p><bold>Diffusion models</bold>. In this section, we present a brief summary of the concepts behind the diffusion model (DM). It is a class of generative models that can learn complex distributions. It involves a forward process of diffusion and a backward process of denoising. Diffusion corrupts the original data with noise, while denoising&#x02014;performed by a learned neural network&#x02014;progressively reduces noise from the corrupted image. There are various formulations of diffusion models, such as score-based generative models (Song and Ermon, <xref ref-type="bibr" rid="B50">2019</xref>) and stochastic differential equations (Song et al., <xref ref-type="bibr" rid="B51">2020</xref>). A comprehensive review can be found in Yang et al. (<xref ref-type="bibr" rid="B64">2022</xref>).</p>
<p>LMD is agnostic to the different DM variants. Here, we describe one prominent variant: the Denoising Diffusion Probabilistic Models (DDPMs) (Sohl-Dickstein et al., <xref ref-type="bibr" rid="B49">2015</xref>; Ho et al., <xref ref-type="bibr" rid="B18">2020</xref>). DDPM&#x00027;s diffusion process begins with a data sample <italic>x</italic><sub>0</sub>, and injects Gaussian noise at every subsequent step <italic>t</italic> &#x0003D; 1, 2, &#x022EF;&#x000A0;, <italic>T</italic> following <xref ref-type="disp-formula" rid="E1">Equation (1)</xref></p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M4"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>q</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mstyle mathvariant="script"><mml:mi>N</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:msqrt><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msqrt><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mstyle mathvariant="bold"><mml:mtext>I</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003B2;<sub><italic>t</italic></sub> adheres to a predetermined variance schedule. The denoising process has a prior distribution <inline-formula><mml:math id="M5"><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0007E;</mml:mo><mml:mrow><mml:mi mathvariant="script">N</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, and formulates the process following <xref ref-type="disp-formula" rid="E2">Equation (2)</xref></p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M6"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mstyle mathvariant="script"><mml:mi mathvariant="script">N</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mo>&#x003A3;</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where both &#x003BC;<sub>&#x003B8;</sub>(<italic>x</italic><sub><italic>t</italic></sub>, <italic>t</italic>) and &#x003A3;<sub>&#x003B8;</sub>(<italic>x</italic><sub><italic>t</italic></sub>, <italic>t</italic>) are parametrized by a neural network &#x003B8;.</p>
</sec>
<sec>
<title>2.2 Lift, Map, Detect</title>
<p>Lift, Map, Detect (LMD) is inspired by the observation that a diffusion model maps images toward the manifold it is trained on. Concretely, it leverages a diffusion model trained over unlabeled in-domain data. Given a test image, LMD applies corruption techniques to lift it from its original manifold, and utilizes the diffusion model to map it toward the in-domain manifold on which the model is trained. As depicted in <xref ref-type="fig" rid="F1">Figure 1</xref>, if the image is indeed in-domain, the model can map it back to its manifold close to its original location. Conversely, if the image belongs to a different manifold, then the diffusion model would redirect it toward the in-domain manifold, moving it further away from its original location. Hence, out-of-domain images often have larger distance between the original and mapped images than in-domain images, and LMD identifies images with large distance as OOD. <xref ref-type="fig" rid="F2">Figure 2</xref> presents the general framework of LMD in <xref ref-type="fig" rid="F2">Figure 2</xref>, and <xref ref-type="fig" rid="F12">Algorithm 1</xref> provides a succinct representation of the LMD algorithm. Subsequent sections explain each component of LMD in detail.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Overview of the LMD process. LMD utilizes a diffusion model trained over the in-domain manifold. It repeatedly lifts an image from its manifold by masking, and maps it toward the diffusion model&#x00027;s training manifold by inpainting. It measures the median distance between the original and the mapped images, and considers images with larger distance as out-of-domain.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-07-1255566-g0002.tif"/>
</fig>
<fig id="F12" position="float">
<label>Algorithm 1</label>
<caption><p>Lift, Map, Detect (LMD).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-07-1255566-g0012.tif"/>
</fig>
<sec>
<title>2.2.1 Lifting and mapping images</title>
<p>LMD lifts an image by masking parts of it, and maps it by inpainting over the masked area. For convenience, we also refer the lifted and mapped images as masked and reconstructed images, respectively. Masking provides a straightward way of controlling the extent to which an image is lifted, as larger masked area generally corresponds to larger deviation from the manifold. Furthermore, recent studies have shown that vanilla diffusion models can perform inpainting without the need for retraining, regardless of the size or shape of the masked regions. This highlights masking and inpainting as an intuitive strategy. <xref ref-type="fig" rid="F13">Algorithm 2</xref> describes the high-level process of inpainting with diffusion models. Additionally, we observe that an alternative way of lifting and mapping an image is to just add noise to it and then denoise with the diffusion model. We compare this instantiation with masking and inpainting in <xref ref-type="table" rid="T4">Table 4</xref>.</p>
<fig id="F13" position="float">
<label>Algorithm 2</label>
<caption><p>Inpaint.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-07-1255566-g0013.tif"/>
</fig>
<p>LMD operates based on the assumption that in-domain images have smaller reconstruction distance than out-of-domain images. In practice, the validity of this assumption depends on two factors. First of all, inpainting with a diffusion model is stochastic. This occasionally leads to unfaithful in-domain reconstructions or faithful out-of-domain reconstructions. Consequently, a single reconstruction distance provides a noisy signal for identifying OOD images. To mitigate the randomness, we perform <bold>multiple reconstructions</bold> for each image, and use the median reconstruction distance as the OOD score. Our experiments in Section 3.4.3 show that this can significantly improve the detection performance.</p>
<p>Another factor to consider is the amount of information removed from an image. In the extreme case where the whole image is masked out, the reconstruction would be a random image from the in-domain manifold. This could lead to large reconstruction distance for both in-domain and out-of-domain images, especially when the in-domain distribution is diverse. Conversely, if only one pixel is removed from an image, then both in-domain and out-of-domain reconstructions would be highly faithful. Therefore, a mask should ideally provide sufficient clues for the diffusion model to map a lifted in-domain image close to its original location, while creating enough space to produce dissimilar out-of-domain reconstructions.</p>
<p>In this regard, we propose to use the <bold>alternating checkerboard</bold> <italic><bold>N</bold></italic>&#x000D7;<italic><bold>N</bold></italic> mask (<xref ref-type="fig" rid="F3">Figure 3</xref>). For simplicity, we assume that images are square-shaped with size <italic>L</italic>&#x000D7;<italic>L</italic>; extension to rectangular-shaped images is straightforward. The checkerboard mask divides the image into an <italic>N</italic>&#x000D7;<italic>N</italic> grid of patches, where each patch has size <inline-formula><mml:math id="M9"><mml:mfrac><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:mfrac><mml:mo>&#x000D7;</mml:mo><mml:mfrac><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula>. It masks out every other patch in a checkerboard-like fashion, covering 50% of an image in total. During multiple reconstructions, the masked and unmasked patches are flipped at each reconstruction attempt. This ensures that salient characteristics of an out-of-domain images are covered at some attempts. We default to <italic>N</italic> &#x0003D; 8. Experiments with different values of <italic>N</italic> can be found in <xref ref-type="table" rid="T2">Table 2</xref>.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>The alternating checkerboard mask. We flip the masked and unmasked regions at each reconstruction attempt. The example in the figure is 8 &#x000D7; 8.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-07-1255566-g0003.tif"/>
</fig> 
</sec>
<sec>
<title>2.2.2 Measuring reconstruction distance</title>
<p>We use the Learned Perceptual Image Patch Similarity (LPIPS) (Zhang et al., <xref ref-type="bibr" rid="B67">2018</xref>) metric to measure the distance between the original and reconstructed images. LPIPS utilizes calibrated intermediate activations of a pretrained neural network as features, and measures the normalized &#x02113;<sub>2</sub> distance between the features of two images. This yields a value between 0 and 1, where lower value indicates higher similarity. We employ the version with AlexNet (Krizhevsky et al., <xref ref-type="bibr" rid="B25">2012</xref>) backbone pretrained on ImageNet.<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> LPIPS has been observed to align with human perception of image similarity (Zhang et al., <xref ref-type="bibr" rid="B67">2018</xref>), and has been applied in research on a wide range of tasks (Karras et al., <xref ref-type="bibr" rid="B21">2019</xref>; Alaluf et al., <xref ref-type="bibr" rid="B2">2021</xref>; Meng et al., <xref ref-type="bibr" rid="B33">2021</xref>) and image modalities (Gong et al., <xref ref-type="bibr" rid="B11">2021</xref>; Lugmayr et al., <xref ref-type="bibr" rid="B31">2022</xref>; Toda et al., <xref ref-type="bibr" rid="B54">2022</xref>). Experiments with alternative metric choices in <xref ref-type="table" rid="T3">Table 3</xref>.</p>
</sec>
</sec>
</sec>
<sec sec-type="results" id="s3">
<title>3 Results</title>
<sec>
<title>3.1 Experiment settings</title>
<p>We benchmark LMD against existing unsupervised OOD detection methods on widely used datasets. We provide fine-grained analysis and visualizations of the reconstructed images to better understand LMD&#x00027;s performance. Additionally, we perform ablation studies to analyze the individual components of LMD.</p>
<sec>
<title>3.1.1 Baselines</title>
<p>We compare LMD with seven existing baselines, covering three mainstream classes of methods: likelihood-based, reconstruction-based and feature-based. For likelihood-based methods, we consider Likelihood <bold>(Likelihood)</bold> (Bishop, <xref ref-type="bibr" rid="B6">1994</xref>), Input Complexity <bold>(IC)</bold> (Serr&#x000E0; et al., <xref ref-type="bibr" rid="B48">2019</xref>) and Likelihood Regret <bold>(LR)</bold> (Xiao et al., <xref ref-type="bibr" rid="B61">2020</xref>). We obtain the likelihood from the diffusion model using Song et al. (<xref ref-type="bibr" rid="B51">2020</xref>)&#x00027;s approach.<xref ref-type="fn" rid="fn0002"><sup>2</sup></xref> We adapt the official GitHub repository of Likelihood Regret<xref ref-type="fn" rid="fn0003"><sup>3</sup></xref> for both Likelihood Regret and Input Complexity. For Input Complexity, we leverage the likelihood from the diffusion model to ensure fairness in comparison; we have experimented with both the PNG compressor and the JPEG compressor, and we report the results from the PNG compressor due to its superior performance. For reconstruction-based methods, we consider Reconstruction with Autoencoder and Mean Squared Error loss <bold>(AE-MSE)</bold>, AutoMahalanobis <bold>(AE-MH)</bold> (Denouden et al., <xref ref-type="bibr" rid="B10">2018</xref>) and AnoGAN <bold>(AnoGAN)</bold> (Schlegl et al., <xref ref-type="bibr" rid="B46">2017</xref>). For feature-based method, we consider Pretrained Feature Extractor &#x0002B; Mahalanobis Distance <bold>(Pretrained)</bold> (Xiao et al., <xref ref-type="bibr" rid="B62">2021</xref>). We use our own implementation as we did not find any existing implementation to our best efforts.</p>
</sec>
<sec>
<title>3.1.2 Evaluation</title>
<p>We evaluate the performance of LMD and the baselines using the area under Receiver Operating Characteristic curve (ROC-AUC), following the practice of existing works (Hendrycks and Gimpel, <xref ref-type="bibr" rid="B15">2016</xref>; Ren et al., <xref ref-type="bibr" rid="B39">2019</xref>; Xiao et al., <xref ref-type="bibr" rid="B62">2021</xref>). OOD detection methods commonly produce numeric OOD scores, and apply a decision threshold to classify data as in-domain or out-of-domain. The ROC curve plots the true positive rate against the false positive rate at various decision thresholds, and ROC-AUC measures the area under the curve. ROC-AUC ranges between 0 and 1, with higher values indicating better performance. A detector achieves ROC-AUC &#x0003E;0.5 when it in general assigns higher OOD scores to out-of-domain images than in-domain images. Conversely, it yields ROC-AUC &#x0003C; 0.5 when it in general assigns higher OOD scores for in-domain images.</p>
</sec>
<sec>
<title>3.1.3 Datasets</title>
<p>For quantitative evaluations, we consider pairwise combinations of CIFAR10 (Krizhevsky, <xref ref-type="bibr" rid="B24">2009</xref>), CIFAR100 (Krizhevsky, <xref ref-type="bibr" rid="B24">2009</xref>) and SVHN (Netzer et al., <xref ref-type="bibr" rid="B37">2011</xref>), and pairwise combinations of MNIST (LeCun et al., <xref ref-type="bibr" rid="B26">2010</xref>), KMNIST (Clanuwat et al., <xref ref-type="bibr" rid="B9">2018</xref>), and FashionMNIST (Xiao et al., <xref ref-type="bibr" rid="B60">2017</xref>), as the in-domain and out-of-domain datasets. This yields 12 pairs in total. For qualitative evaluations, we further present visualizations on two pairs of in-domain vs. out-of-domain datasets with higher image resolutions: CelebA-HQ (Karras et al., <xref ref-type="bibr" rid="B20">2017</xref>) vs. ImageNet (Russakovsky et al., <xref ref-type="bibr" rid="B42">2015</xref>), and LSUN bedroom (Yu et al., <xref ref-type="bibr" rid="B65">2015</xref>) vs. LSUN classroom (Yu et al., <xref ref-type="bibr" rid="B65">2015</xref>). We standardize these images to 256 &#x000D7; 256.</p>
</sec>
<sec>
<title>3.1.4 Implementation details of LMD</title>
<p>We build LMD on top of Song et al. (<xref ref-type="bibr" rid="B51">2020</xref>)&#x00027;s implementation. For datasets in <xref ref-type="table" rid="T3">Table 3</xref>, we use DDPM&#x0002B;&#x0002B; models with SubVP SDE. We take Song et al. (<xref ref-type="bibr" rid="B51">2020</xref>)&#x00027;s pretrained CIFAR10 checkpoint, and train from scratch for the other datasets. We use alternating checkerboard 8 &#x000D7; 8 mask (<xref ref-type="fig" rid="F3">Figure 3</xref>), reconstruction distance metric LPIPS and 10 reconstructions per image for LMD.</p>
<p>For the higher resolution datasets, we use NCSN&#x0002B;&#x0002B; models with VE SDE. We take Song et al. (<xref ref-type="bibr" rid="B51">2020</xref>)&#x00027;s pretrained FFHQ (Karras et al., <xref ref-type="bibr" rid="B21">2019</xref>) checkpoint for CelebA-HQ vs. ImageNet. This is to avoid model memorization concerns given that the CelebA-HQ checkpoint is pretrained over the whole dataset. We use Song et al. (<xref ref-type="bibr" rid="B51">2020</xref>)&#x00027;s pretrained LSUN bedroom checkpoint for LSUN bedroom vs. LSUN classroom. For these datasets, we consider a checkerboard 4 &#x000D7; 4 mask, a checkerboard 8 &#x000D7; 8 mask and a square-centered mask, with <italic>one</italic> reconstruction per image. We additionally report the ROC-AUC from our default configuration of alternating 8 &#x000D7; 8 checkerboard and 10 reconstructions per image as a reference. We use LPIPS as the distance metric.</p>
</sec>
</sec>
<sec>
<title>3.2 Quantitative results and analysis</title>
<p>We present the OOD detection performance of LMD and the baselines on 12 dataset pairs in <xref ref-type="table" rid="T1">Table 1</xref>. LMD attains the highest ROC-AUC on five pairs, while demonstrating consistent and strong performance on others. Specifically, on CIFAR100 vs. SVHN, it attains 10% higher ROC-AUC than the best baseline performance. LMD also attains the highest average ROC-AUC of 0.907, which is 9% higher than the best average performance among the baselines. We visualize examples of the in-domain and out-of-domain reconstructions of LMD in <xref ref-type="fig" rid="F4">Figure 4</xref>. In general, in-domain reconstructions resemble their original images, while out-of-domain reconstructions are fragmented and noisy.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>ROC-AUC of LMD and the baselines.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>ID</bold></th>
<th valign="top" align="center"><bold>OOD</bold></th>
<th valign="top" align="center"><bold>Likelihood</bold></th>
<th valign="top" align="center"><bold>IC</bold></th>
<th valign="top" align="center"><bold>LR</bold></th>
<th valign="top" align="center"><bold>Pretrained</bold></th>
<th valign="top" align="center"><bold>AE-MSE</bold></th>
<th valign="top" align="center"><bold>AE-MH</bold></th>
<th valign="top" align="center"><bold>AnoGAN</bold></th>
<th valign="top" align="center"><bold>LMD</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">CIFAR10</td>
<td valign="top" align="center">CIFAR100</td>
<td valign="top" align="center">0.520</td>
<td valign="top" align="center">0.568</td>
<td valign="top" align="center">0.546</td>
<td valign="top" align="center"><bold>0.806</bold></td>
<td valign="top" align="center">0.510</td>
<td valign="top" align="center">0.488</td>
<td valign="top" align="center">0.518</td>
<td valign="top" align="center">0.607</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">SVHN</td>
<td valign="top" align="center">0.180</td>
<td valign="top" align="center">0.870</td>
<td valign="top" align="center">0.904</td>
<td valign="top" align="center">0.888</td>
<td valign="top" align="center">0.025</td>
<td valign="top" align="center">0.073</td>
<td valign="top" align="center">0.120</td>
<td valign="top" align="center"><bold>0.992</bold></td>
</tr>
<tr>
<td valign="top" align="left">CIFAR100</td>
<td valign="top" align="center">CIFAR10</td>
<td valign="top" align="center">0.495</td>
<td valign="top" align="center">0.468</td>
<td valign="top" align="center">0.484</td>
<td valign="top" align="center">0.543</td>
<td valign="top" align="center">0.509</td>
<td valign="top" align="center">0.486</td>
<td valign="top" align="center">0.510</td>
<td valign="top" align="center"><bold>0.568</bold></td>
</tr>
<tr>
<td/>
<td valign="top" align="center">SVHN</td>
<td valign="top" align="center">0.193</td>
<td valign="top" align="center">0.792</td>
<td valign="top" align="center">0.896</td>
<td valign="top" align="center">0.776</td>
<td valign="top" align="center">0.027</td>
<td valign="top" align="center">0.122</td>
<td valign="top" align="center">0.131</td>
<td valign="top" align="center"><bold>0.985</bold></td>
</tr>
<tr>
<td valign="top" align="left">SVHN</td>
<td valign="top" align="center">CIFAR10</td>
<td valign="top" align="center">0.974</td>
<td valign="top" align="center">0.973</td>
<td valign="top" align="center">0.805</td>
<td valign="top" align="center"><bold>0.999</bold></td>
<td valign="top" align="center">0.981</td>
<td valign="top" align="center">0.966</td>
<td valign="top" align="center">0.967</td>
<td valign="top" align="center">0.914</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">CIFAR100</td>
<td valign="top" align="center">0.970</td>
<td valign="top" align="center">0.976</td>
<td valign="top" align="center">0.821</td>
<td valign="top" align="center"><bold>0.999</bold></td>
<td valign="top" align="center">0.980</td>
<td valign="top" align="center">0.966</td>
<td valign="top" align="center">0.962</td>
<td valign="top" align="center">0.876</td>
</tr>
<tr>
<td valign="top" align="left">MNIST</td>
<td valign="top" align="center">KMNIST</td>
<td valign="top" align="center">0.948</td>
<td valign="top" align="center">0.903</td>
<td valign="top" align="center">0.999</td>
<td valign="top" align="center">0.887</td>
<td valign="top" align="center">0.999</td>
<td valign="top" align="center"><bold>1.000</bold></td>
<td valign="top" align="center">0.933</td>
<td valign="top" align="center">0.984</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">FashionMNIST</td>
<td valign="top" align="center">0.997</td>
<td valign="top" align="center"><bold>1.000</bold></td>
<td valign="top" align="center">0.999</td>
<td valign="top" align="center">0.999</td>
<td valign="top" align="center"><bold>1.000</bold></td>
<td valign="top" align="center"><bold>1.000</bold></td>
<td valign="top" align="center">0.992</td>
<td valign="top" align="center">0.999</td>
</tr>
<tr>
<td valign="top" align="left">KMNIST</td>
<td valign="top" align="center">MNIST</td>
<td valign="top" align="center">0.152</td>
<td valign="top" align="center">0.951</td>
<td valign="top" align="center">0.431</td>
<td valign="top" align="center">0.582</td>
<td valign="top" align="center">0.102</td>
<td valign="top" align="center">0.217</td>
<td valign="top" align="center">0.317</td>
<td valign="top" align="center"><bold>0.978</bold></td>
</tr>
<tr>
<td/>
<td valign="top" align="center">FashionMNIST</td>
<td valign="top" align="center">0.833</td>
<td valign="top" align="center"><bold>0.999</bold></td>
<td valign="top" align="center">0.557</td>
<td valign="top" align="center">0.993</td>
<td valign="top" align="center">0.896</td>
<td valign="top" align="center">0.868</td>
<td valign="top" align="center">0.701</td>
<td valign="top" align="center">0.993</td>
</tr>
<tr>
<td valign="top" align="left">FashionMNIST</td>
<td valign="top" align="center">MNIST</td>
<td valign="top" align="center">0.172</td>
<td valign="top" align="center">0.912</td>
<td valign="top" align="center">0.971</td>
<td valign="top" align="center">0.647</td>
<td valign="top" align="center">0.804</td>
<td valign="top" align="center">0.969</td>
<td valign="top" align="center">0.835</td>
<td valign="top" align="center"><bold>0.992</bold></td>
</tr>
<tr>
<td/>
<td valign="top" align="center">KMNIST</td>
<td valign="top" align="center">0.542</td>
<td valign="top" align="center">0.584</td>
<td valign="top" align="center">0.994</td>
<td valign="top" align="center">0.730</td>
<td valign="top" align="center">0.976</td>
<td valign="top" align="center"><bold>0.996</bold></td>
<td valign="top" align="center">0.912</td>
<td valign="top" align="center">0.990</td>
</tr>
<tr>
<td valign="top" align="left" colspan="2">Average</td>
<td valign="top" align="center">0.581</td>
<td valign="top" align="center">0.833</td>
<td valign="top" align="center">0.783</td>
<td valign="top" align="center">0.821</td>
<td valign="top" align="center">0.651</td>
<td valign="top" align="center">0.679</td>
<td valign="top" align="center">0.658</td>
<td valign="top" align="center"><bold>0.907</bold></td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Higher value is better. We use the default configuration of alternating checkerboard 8 &#x000D7; 8, LPIPS metric and 10 reconstructions per image for all experiments. LMD consistently demonstrates strong performance and attains the highest average ROC-AUC. The bold values mean the best performance, i.e., the highest ROC-AUC, among the evaluated methods in each setting, where a setting is either an in-domain vs. out-of-domain dataset pair or the average across all dataset pairs.</p>
</table-wrap-foot>
</table-wrap>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Example reconstructions from three pairs of dataset. &#x0201C;Orig.&#x0201D; is the original image and &#x0201C;Inp.&#x0201D; is the inpainted image. Generally, the in-domain reconstructions are faithful while the out-of-domain reconstructions are noisy and dissimilar.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-07-1255566-g0004.tif"/>
</fig>
<p>We further conduct fine-grained analysis to understand LMD&#x00027;s performance. We observe that each dataset in <xref ref-type="table" rid="T1">Table 1</xref> consists of images from multiple distinct semantic categories, forming a diverse data distribution. For example, CIFAR10 comprises 10 different objects or animals, and SVHN comprises 10 digits. We seek to understand whether LMD performs similarly across different semantic categories, or if certain categories are more challenging for LMD than the others. Specifically, we group the images by their ground truth classes, and examine the distinguishability of the OOD scores for each pair of classes of the in-domain vs. out-of-domain datasets. We present the results for CIFAR10 vs. SVHN and SVHN vs. CIFAR10 in <xref ref-type="fig" rid="F5">Figure 5</xref>. On CIFAR10 vs. SVHN, all pairs of classes are highly distinguishable, with ROC-AUC ranging from 0.97 to 1. This is unsurprising given that LMD attains strong performance of ROC-AUC 0.992 on this pair. On SVHN vs. CIFAR10, pairwise performance shows visible variation, with ROC-AUC ranging from 0.84 to 0.97. Specifically, the ROC-AUC is relatively low when the in-domain class is &#x0201C;3&#x0201D; or &#x0201C;5,&#x0201D; and when the out-of-domain class is &#x0201C;deer&#x0201D; or &#x0201C;frog.&#x0201D; This suggests that the reason behind LMD&#x00027;s satisfactory but suboptimal performance on SVHN vs. CIFAR10 is primarily attributed to the relative difficulty in distinguishing between some of the semantic categories.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Per-Class ROC-AUC for CIFAR10 vs. SVHN and SVHN vs. CIFAR10. The classes for CIFAR10 are: 1, airplane; 2, automobile; 3, bird; 4, cat; 5, deer; 6, dog; 7, frog; 8, horse; 9, ship; and 10, truck. <bold>(A)</bold> CIFAR10 vs. SVHN. <bold>(B)</bold> SVHN vs. CIFAR10.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-07-1255566-g0005.tif"/>
</fig>
</sec>
<sec>
<title>3.3 Qualitative studies on higher resolution images</title>
<p>We show qualitative results on images with resolution 256 &#x000D7; 256 for two in-domain/out-of-domain pairs: CelebA-HQ vs. ImageNet (<xref ref-type="fig" rid="F6">Figure 6</xref>) and LSUN bedroom vs. LSUN classroom (<xref ref-type="fig" rid="F7">Figure 7</xref>). The ROC-AUCs in the images correspond to LMD&#x00027;s performance with only <italic>one</italic> reconstruction attempt. As a reference, under our default configuration of alternating checkerboard 8 &#x000D7; 8 mask and 10 reconstruction attempts, CelebA-HQ vs. ImageNet has a ROC-AUC of <bold>0.993</bold>, and LSUN bedroom vs. LSUN classroom has a ROC-AUC of <bold>0.927</bold>.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Examples of image reconstruction from CelebA-HQ (in-domain) and ImageNet (out-of-domain). For out-of-domain reconstructions, the checkerboard masks result in local inconsistencies, while the center mask hallucinates faces. In this case, employing larger masked patches slightly improves the performance.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-07-1255566-g0006.tif"/>
</fig>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Reconstruction examples from LSUN bedroom (in-domain) and LSUN classroom (out-of-domain). As bedroom images are diverse and contain richer details, a mask with smaller patches is preferrable.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-07-1255566-g0007.tif"/>
</fig>
<p>For CeleA-HQ vs. ImageNet, LMD performs competitively under all three mask choices, and achieves ROC-AUC ranging from 0.991 to 1 even without multiple reconstructions. Given the highly structured nature of human faces, the in-domain reconstructions under all three masks are accurate. For the out-of-domain images, reconstructions under the checkerboard masks contain local distortions, while reconstructions under the center mask tend to hallucinate faces. As a result, in this case, the in-domain and out-of-domain reconstructions become more discernible when employing larger patches in masking.</p>
<p>For LSUN bedroom vs. LSUN classroom, the checkerboard 8 &#x000D7; 8 mask attains strong results, while the checkerboard 4 &#x000D7; 4 mask and the center-squared mask demonstrate suboptimal performance. This is because bedroom images exhibit greater variation and contain more intricate details. Consequently, when large patches are masked, the diffusion model may fill in plausible yet different content, resulting in significant reconstruction discrepancies for in-domain images. In fact, even with the checkerboard 8 &#x000D7; 8 mask, the diffusion model may hallucinate or alter elements in the bedroom inpaintings. Moreover, the complex and diverse nature of bedroom images poses substantial challenges for the diffusion model to accurately learn the in-domain distribution; samples and inpaintings from the LSUN bedroom model generally have lower quality than those from the CelebA-HQ model.</p>
<p>Results from these two dataset pairs collectively demonstrate that LMD could scale to higher resolution images with richer details. They also highlight the checkerboard 8 &#x000D7; 8 mask as a versatile default choice, as it is effective for both structured and diverse in-domain distributions. For further discussions on mask choices, please refer to Section 3.4.1.</p>
</sec>
<sec>
<title>3.4 Ablation studies</title>
<sec>
<title>3.4.1 Mask choice</title>
<p><xref ref-type="table" rid="T2">Table 2</xref> presents the performance of LMD under alternative mask choices. Besides our default mask, we consider alternating checkerboard 4 &#x000D7; 4, alternating checkerboard 16 &#x000D7; 16, a fixed 8 &#x000D7; 8 checkerboard for which we do not perform the flipping operation, a square-centered mask, and a random patch mask following (Xie et al., <xref ref-type="bibr" rid="B63">2022</xref>).<xref ref-type="fn" rid="fn0004"><sup>4</sup></xref> <xref ref-type="fig" rid="F8">Figure 8</xref> visualizes the mask patterns. We experiment on three dataset pairs: CIFAR10 vs. CIFAR100, CIFAR10 vs. SVHN and MNIST vs. KMNIST. For all the mask choices, we perform 10 reconstructions per image and use LPIPS as the reconstruction distance metric.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Performance of ROC-AUC on three dataset pairs with different mask types.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Mask type</bold></th>
<th valign="top" align="center"><bold>CIFAR10 vs. CIFAR100</bold></th>
<th valign="top" align="center"><bold>CIFAR10 vs. SVHN</bold></th>
<th valign="top" align="center"><bold>MNIST vs. KMNIST</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Alternating checkerboard 4 &#x000D7; 4</td>
<td valign="top" align="center">0.594</td>
<td valign="top" align="center">0.987</td>
<td valign="top" align="center">0.923</td>
</tr>
<tr>
<td valign="top" align="left">Alternating checkerboard 8 &#x000D7; 8</td>
<td valign="top" align="center"><bold>0.607</bold></td>
<td valign="top" align="center"><bold>0.992</bold></td>
<td valign="top" align="center">0.984</td>
</tr>
<tr>
<td valign="top" align="left">Alternating checkerboard 16 &#x000D7; 16</td>
<td valign="top" align="center">0.597</td>
<td valign="top" align="center">0.981</td>
<td valign="top" align="center"><bold>0.997</bold></td>
</tr>
<tr>
<td valign="top" align="left">Fixed checkerboard 8 &#x000D7; 8</td>
<td valign="top" align="center">0.601</td>
<td valign="top" align="center">0.990</td>
<td valign="top" align="center">0.974</td>
</tr>
<tr>
<td valign="top" align="left">Center</td>
<td valign="top" align="center">0.570</td>
<td valign="top" align="center">0.978</td>
<td valign="top" align="center">0.479</td>
</tr>
<tr>
<td valign="top" align="left">Random patch</td>
<td valign="top" align="center">0.591</td>
<td valign="top" align="center">0.990</td>
<td valign="top" align="center">0.912</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>The alternating checkerboard 8 &#x000D7; 8 shows strong and consistent results. The bold values mean the best performance, i.e., the highest ROC-AUC, among the evaluated methods in each setting, where a setting is either an in-domain vs. out-of-domain dataset pair or the average across all dataset pairs.</p>
</table-wrap-foot>
</table-wrap>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Masks used in the mask ablation. The random patch mask in the figure is just one example; a different pattern is sampled each time.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-07-1255566-g0008.tif"/>
</fig>
<p>Our default mask choice of alternating checkerboard 8 &#x000D7; 8 shows consistent and strong performance. Alternating checkerboard 16 &#x000D7; 16 mask, fixed checkerboard 8 &#x000D7; 8 mask and the random patch mask are competitive but underperform the default choice. Nevertheless, alternating checkerboard mask is recommended over fixed checkerboard mask or random patch mask, as it ensures that all parts of the image are covered in some of the reconstruction attempts. Alternating checkerboard 4 &#x000D7; 4 and square-centered masks show suboptimal performance on MNIST vs. KMNIST. This is because they mask out too much information from the images, and therefore lead to unfaithful reconstructions for both in-domain and out-of-domain images.</p>
</sec>
<sec>
<title>3.4.2 Reconstruction distance metric</title>
<p>We study the effect of using alternative metrics for measuring the reconstruction distance. We consider two popular metrics, Mean Squared Error (MSE) and Structural Similarity Index Measure (SSIM) (Wang et al., <xref ref-type="bibr" rid="B56">2003</xref>), both of which have been widely used for image comparison (Zhang et al., <xref ref-type="bibr" rid="B66">2019</xref>; Bhat et al., <xref ref-type="bibr" rid="B5">2021</xref>; Saharia et al., <xref ref-type="bibr" rid="B43">2022</xref>). We further observe that Xiao et al. (<xref ref-type="bibr" rid="B62">2021</xref>) uses features from a ResNet-50 pretrained with SimCLRv2 (Chen et al., <xref ref-type="bibr" rid="B7">2020</xref>) on ImageNet, and achieves superior performance on CIFAR10 vs. CIFAR100. Thus, we also consider a SimCLRv2-based metric, in which we calculate the cosine distance between the SimCLRv2 features of the original and reconstructed images.</p>
<p>We present the performance of LMD under different distance metrics in <xref ref-type="table" rid="T3">Table 3</xref>. MSE and SSIM demonstrate poor performance when SVHN is the out-of-domain dataset. Our default choice LPIPS demonstrates strong and consistent performance, and attains the highest average ROC-AUC. SimCLRv2 is competitive but underperforms LPIPS. This suggests that deep feature based metrics are in general effective, and LPIPS is suitable as a default choice.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>ROC-AUC performance under different reconstruction distance metrics.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>ID</bold></th>
<th valign="top" align="center"><bold>OOD</bold></th>
<th valign="top" align="center"><bold>MSE</bold></th>
<th valign="top" align="center"><bold>SSIM</bold></th>
<th valign="top" align="center"><bold>LPIPS</bold></th>
<th valign="top" align="center"><bold>SimCLRv2</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">CIFAR10</td>
<td valign="top" align="center">CIFAR100</td>
<td valign="top" align="center">0.548</td>
<td valign="top" align="center">0.624</td>
<td valign="top" align="center">0.607</td>
<td valign="top" align="center">0.713</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">SVHN</td>
<td valign="top" align="center">0.155</td>
<td valign="top" align="center">0.329</td>
<td valign="top" align="center">0.992</td>
<td valign="top" align="center">0.970</td>
</tr>
<tr>
<td valign="top" align="left">CIFAR100</td>
<td valign="top" align="center">CIFAR10</td>
<td valign="top" align="center">0.549</td>
<td valign="top" align="center">0.551</td>
<td valign="top" align="center">0.568</td>
<td valign="top" align="center">0.523</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">SVHN</td>
<td valign="top" align="center">0.157</td>
<td valign="top" align="center">0.258</td>
<td valign="top" align="center">0.985</td>
<td valign="top" align="center">0.924</td>
</tr>
<tr>
<td valign="top" align="left">SVHN</td>
<td valign="top" align="center">CIFAR10</td>
<td valign="top" align="center">0.987</td>
<td valign="top" align="center">0.998</td>
<td valign="top" align="center">0.914</td>
<td valign="top" align="center">0.933</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">CIFAR100</td>
<td valign="top" align="center">0.979</td>
<td valign="top" align="center">0.995</td>
<td valign="top" align="center">0.876</td>
<td valign="top" align="center">0.928</td>
</tr>
<tr>
<td valign="top" align="left">MNIST</td>
<td valign="top" align="center">KMNIST</td>
<td valign="top" align="center">0.998</td>
<td valign="top" align="center">0.997</td>
<td valign="top" align="center">0.984</td>
<td valign="top" align="center">0.983</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">FashionMNIST</td>
<td valign="top" align="center">0.995</td>
<td valign="top" align="center">0.999</td>
<td valign="top" align="center">0.999</td>
<td valign="top" align="center">0.999</td>
</tr>
<tr>
<td valign="top" align="left">KMNIST</td>
<td valign="top" align="center">MNIST</td>
<td valign="top" align="center">0.835</td>
<td valign="top" align="center">0.922</td>
<td valign="top" align="center">0.978</td>
<td valign="top" align="center">0.920</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">FashionMNIST</td>
<td valign="top" align="center">0.802</td>
<td valign="top" align="center">0.979</td>
<td valign="top" align="center">0.993</td>
<td valign="top" align="center">0.995</td>
</tr>
<tr>
<td valign="top" align="left">FashionMNIST</td>
<td valign="top" align="center">MNIST</td>
<td valign="top" align="center">0.993</td>
<td valign="top" align="center">0.960</td>
<td valign="top" align="center">0.992</td>
<td valign="top" align="center">0.961</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">KMNIST</td>
<td valign="top" align="center">0.998</td>
<td valign="top" align="center">0.988</td>
<td valign="top" align="center">0.990</td>
<td valign="top" align="center">0.977</td>
</tr>
<tr>
<td valign="top" align="left">Average</td>
<td/>
<td valign="top" align="center">0.750</td>
<td valign="top" align="center">0.800</td>
<td valign="top" align="center"><bold>0.907</bold></td>
<td valign="top" align="center">0.902</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>LPIPS demonstrates consistent and robust results, while other metrics exhibit performance fluctuations. The bold values mean the best performance, i.e., the highest ROC-AUC, among the evaluated methods in each setting, where a setting is either an in-domain vs. out-of-domain dataset pair or the average across all dataset pairs.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>3.4.3 Number of reconstructions per image</title>
<p>We examine LMD&#x00027;s performance under different number of reconstructions per image. <xref ref-type="fig" rid="F9">Figure 9</xref> plots the ROC-AUC against the number of reconstructions per image for MNIST vs. KMNIST and KMNIST vs. MNIST. LMD&#x00027;s performance improves as the number of reconstructions increases, regardless of the choice of distance metric. The improvement is especially obvious for the first 5 attempts, and gradually plateaus as the number of attempts approaches 10. This suggests that it is generally sufficient to perform 10 attempts per image.</p>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>ROC-AUC vs. the number of reconstruction attempts. More reconstruction attempts enhances the OOD detection performance, irrespective of the distance metric. <bold>(A)</bold> MNIST vs. KMNIST. <bold>(B)</bold> KMNIST vs. MNIST.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-07-1255566-g0009.tif"/>
</fig>
</sec>
<sec>
<title>3.4.4 Alternative instantiation of lifting and mapping</title>
<p>We observe that another intuitive way of lifting and mapping images with a diffusion model is to lift by diffusion to an intermediate step <italic>t</italic> in the noise schedule, and denoising back to the image distribution. We refer to this alternative instantiation as diffusion/denoising, and compare it with our default instantiation of masking/inpainting. Given that the image distribution is at <italic>t</italic> &#x0003D; 0 and the noise distribution is at <italic>t</italic> &#x0003D; <italic>T</italic>, the larger <italic>t</italic> we diffuse to, the further away we lift an image from the manifold. We consider different lifting distances with <italic>t</italic> &#x0003D; 250, <italic>t</italic> &#x0003D; 500, and <italic>t</italic> &#x0003D; 750, where the full schedule has <italic>T</italic> &#x0003D; 1000. We use our default alternating checkerboard 8 &#x000D7; 8 mask for masking/inpainting. We use 10 reconstructions per image and the LPIPS metric for both diffusion/denoising and masking/inpainting.</p>
<p>We present the performance in <xref ref-type="table" rid="T4">Table 4</xref>. Diffusion/denoising with <italic>t</italic> &#x0003D; 250 and <italic>t</italic> &#x0003D; 750 demonstrate suboptimal performance on several pairs, indicating that the lifting distance is too small or too large for the in-domain and out-of-domain to be distinguishable. <italic>t</italic> &#x0003D; 500 is competitive but underperforms masking/inpainting. This suggests that while LMD is robust to alternative choices of lifting and mapping, masking/inpainting is the recommended instantiation.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>ROC-AUC performance of using diffusion/denoising vs. masking/inpainting.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>ID</bold></th>
<th valign="top" align="left"><bold>OOD</bold></th>
<th valign="top" align="left"><bold>Denoising (<italic>t</italic> = 250)</bold></th>
<th valign="top" align="left"><bold>Denoising (<italic>t</italic> = 500)</bold></th>
<th valign="top" align="left"><bold>Denoising (<italic>t</italic> = 750)</bold></th>
<th valign="top" align="left"><bold>Inpainting</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">CIFAR10</td>
<td valign="top" align="left">CIFAR100</td>
<td valign="top" align="left">0.583</td>
<td valign="top" align="left">0.600</td>
<td valign="top" align="left">0.589</td>
<td valign="top" align="left">0.607</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">SVHN</td>
<td valign="top" align="left">0.967</td>
<td valign="top" align="left">0.976</td>
<td valign="top" align="left">0.954</td>
<td valign="top" align="left">0.992</td>
</tr>
<tr>
<td valign="top" align="left">CIFAR100</td>
<td valign="top" align="left">CIFAR10</td>
<td valign="top" align="left"><bold>0.568</bold></td>
<td valign="top" align="left">0.524</td>
<td valign="top" align="left">0.436</td>
<td valign="top" align="left"><bold>0.568</bold></td>
</tr>
<tr>
<td/>
<td valign="top" align="left">SVHN</td>
<td valign="top" align="left">0.949</td>
<td valign="top" align="left">0.957</td>
<td valign="top" align="left">0.904</td>
<td valign="top" align="left"><bold>0.985</bold></td>
</tr>
<tr>
<td valign="top" align="left">SVHN</td>
<td valign="top" align="left">CIFAR10</td>
<td valign="top" align="left">0.861</td>
<td valign="top" align="left"><bold>0.966</bold></td>
<td valign="top" align="left">0.957</td>
<td valign="top" align="left">0.914</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">CIFAR100</td>
<td valign="top" align="left">0.847</td>
<td valign="top" align="left">0.949</td>
<td valign="top" align="left"><bold>0.957</bold></td>
<td valign="top" align="left">0.876</td>
</tr>
<tr>
<td valign="top" align="left">MNIST</td>
<td valign="top" align="left">KMNIST</td>
<td valign="top" align="left">0.956</td>
<td valign="top" align="left"><bold>0.993</bold></td>
<td valign="top" align="left">0.715</td>
<td valign="top" align="left">0.984</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">FashionMNIST</td>
<td valign="top" align="left">0.998</td>
<td valign="top" align="left">0.998</td>
<td valign="top" align="left">0.927</td>
<td valign="top" align="left"><bold>0.999</bold></td>
</tr>
<tr>
<td valign="top" align="left">KMNIST</td>
<td valign="top" align="left">MNIST</td>
<td valign="top" align="left">0.645</td>
<td valign="top" align="left">0.972</td>
<td valign="top" align="left">0.721</td>
<td valign="top" align="left"><bold>0.978</bold></td>
</tr>
<tr>
<td/>
<td valign="top" align="left">FashionMNIST</td>
<td valign="top" align="left"><bold>0.998</bold></td>
<td valign="top" align="left">0.994</td>
<td valign="top" align="left">0.943</td>
<td valign="top" align="left">0.993</td>
</tr>
<tr>
<td valign="top" align="left">FashionMNIST</td>
<td valign="top" align="left">MNIST</td>
<td valign="top" align="left">0.428</td>
<td valign="top" align="left">0.941</td>
<td valign="top" align="left">0.876</td>
<td valign="top" align="left"><bold>0.992</bold></td>
</tr>
<tr>
<td/>
<td valign="top" align="left">KMNIST</td>
<td valign="top" align="left">0.567</td>
<td valign="top" align="left">0.943</td>
<td valign="top" align="left">0.862</td>
<td valign="top" align="left"><bold>0.990</bold></td>
</tr>
<tr>
<td valign="top" align="left">Average</td>
<td/>
<td valign="top" align="left">0.781</td>
<td valign="top" align="left">0.901</td>
<td valign="top" align="left">0.820</td>
<td valign="top" align="left"><bold>0.907</bold></td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Diffusion/denoising with <italic>t</italic> &#x0003D; 500 achieves reasonable performance but underperforms diffusion/inpainting. The bold values mean the best performance, i.e., the highest ROC-AUC, among the evaluated methods in each setting, where a setting is either an in-domain vs. out-of-domain dataset pair or the average across all dataset pairs.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>3.4.5 Alternative choices for the inpainting model</title>
<p>We perform qualitative evaluation on using other classes of inpainting models in the LMD framework. We consider <bold>Masked Autoencoder (MAE)</bold> (He et al., <xref ref-type="bibr" rid="B14">2022</xref>) trained on CIFAR10,<xref ref-type="fn" rid="fn0005"><sup>5</sup></xref> and <bold>LaMa</bold> (Suvorov et al., <xref ref-type="bibr" rid="B52">2022</xref>),<xref ref-type="fn" rid="fn0006"><sup>6</sup></xref> a GAN-based inpainting model, trained on CelebA-HQ. We perform <italic>one</italic> reconstruction per image, as both MAE and LaMa are deterministic.</p>
<p>Both models demonstrate lower performance than the diffusion model in various scenarios. <xref ref-type="fig" rid="F10">Figure 10</xref> shows LaMa&#x00027;s performance on CelebA-HQ vs. ImageNet. LaMa attains reasonable results, but it underperforms diffusion models. LaMa hallucinates faces with the center mask, but unlike the diffusion model, the color and texture of the hallucinated faces are very consistent with the surroundings.</p>
<fig id="F10" position="float">
<label>Figure 10</label>
<caption><p>Reconstruction examples from CelebA-HQ (in-domain) and ImageNet (out-of-domain) using LaMa, a GAN-based inpainting model. Unlike the diffusion model, LaMa produces less visible artifacts. Even though it also introduces face-like artifacts with the center mask, the faces have the colors and textures of the surrounding unmasked regions.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-07-1255566-g0010.tif"/>
</fig>
<p><xref ref-type="fig" rid="F11">Figure 11</xref> shows MAE&#x00027;s performance on CIFAR10 vs. SVHN. Both in-domain and out-of-domain reconstructions are accurate when the individual masked patch sizes are small, while both deviate from the originals when the patch sizes are large. Performance-wise, inpainting with MAE only attains ROC-AUC 0.065 for checkerboard 8 &#x000D7; 8 mask, 0.178 for checkerboard 4 &#x000D7; 4 mask and 0.403 for center mask.</p>
<fig id="F11" position="float">
<label>Figure 11</label>
<caption><p>Reconstruction examples from CIFAR10 (in-domain) and SVHN (out-of-domain) using MAE. Differentiating between in-domain and out-of-domain inpaintings are hard, because reconstructing SVHN from only known regions is relatively simple, and because MAE is trained to have strong capability of inference from known regions.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-07-1255566-g0011.tif"/>
</fig>
<p>The suboptimal performance of alternative inpainting models can be attributed to their ability to leverage various sources of information&#x02014;from not only its understanding of the training distribution, but also color or texture of unmasked parts of an image. Models like LaMa and MAE employ specialized loss functions and large masked ratios during training, and thus excel at inferring missing regions from known ones regardless of semantics. Consequently, these models are more prone to producing reasonable out-of-domain inpaintings, especially with simpler out-of-domain images. In contrast, a vanilla diffusion model is not specifically trained for inferring missing regions from the surroundings. It primarily relies on its understanding of the training distribution to perform inpainting, and thus attains robust performance.</p>
</sec>
</sec>
</sec>
<sec sec-type="discussion" id="s4">
<title>4 Discussion</title>
<sec>
<title>4.1 LMD&#x00027;s relationship with existing works</title>
<p>In the unsupervised setting, existing works generally follow one of the three paradigms: likelihood-based, reconstruction-based and feature-based. LMD is a reconstruction-based approach. Typically, reconstruction-based methods involve training a model using in-domain samples, and assessing the reconstruction quality of a test data point under the model. Prior works commonly use autoencoders (Sakurada and Yairi, <xref ref-type="bibr" rid="B44">2014</xref>; Xia et al., <xref ref-type="bibr" rid="B59">2015</xref>; Zhou and Paffenroth, <xref ref-type="bibr" rid="B68">2017</xref>; Zong et al., <xref ref-type="bibr" rid="B69">2018</xref>) or GANs (Schlegl et al., <xref ref-type="bibr" rid="B46">2017</xref>; Li et al., <xref ref-type="bibr" rid="B28">2018</xref>). One concurrent work (Graham et al., <xref ref-type="bibr" rid="B12">2022</xref>) utilizes diffusion models, and considers image reconstructions under varying numbers of diffusion and denoising steps. This contrasts with LMD, which repeatedly performs masking and inpainting with fixed number of steps. These two approaches are orthogonal and complementary.</p>
<p>The likelihood-based paradigm has been extensively explored, with early contributions dating back to Bishop (<xref ref-type="bibr" rid="B6">1994</xref>). The core idea is to approximate the in-domain distribution with a generative model that has likelihood computation capability (Salimans et al., <xref ref-type="bibr" rid="B45">2017</xref>; Kingma and Dhariwal, <xref ref-type="bibr" rid="B22">2018</xref>). Intuitively, the model should assign higher likelihood to in-domain data than out-of-domain data, but various studies have observed that such assumption often does not hold (Choi et al., <xref ref-type="bibr" rid="B8">2018</xref>; Nalisnick et al., <xref ref-type="bibr" rid="B35">2018</xref>; Kirichenko et al., <xref ref-type="bibr" rid="B23">2020</xref>). One line of work addresses this issue under a typicality test framework (Ren et al., <xref ref-type="bibr" rid="B39">2019</xref>; Serr&#x000E0; et al., <xref ref-type="bibr" rid="B48">2019</xref>; Xiao et al., <xref ref-type="bibr" rid="B61">2020</xref>). Essentially, they view likelihood as a model statistic rather than a literal measure of how likely a data point is in-domain. They examine the extent to which the model statistic of a test data point deviates from the typical distribution of model statistics for in-domain data. Notably, this is complementary to LMD, as the reconstruction distance can also be viewed as a model statistic. Other likelihood-based approaches include adjusting the likelihood by background likelihood (Ren et al., <xref ref-type="bibr" rid="B39">2019</xref>), image complexity (Serr&#x000E0; et al., <xref ref-type="bibr" rid="B48">2019</xref>) or the likelihood under optimal model configurations (Xiao et al., <xref ref-type="bibr" rid="B61">2020</xref>), or improving the generative model architectures (Maal&#x000F8;e et al., <xref ref-type="bibr" rid="B32">2019</xref>; Kirichenko et al., <xref ref-type="bibr" rid="B23">2020</xref>).</p>
<p>The feature-based paradigm usually involves extracting lower-dimensional features from the data from unsupervised sources, such as autoencoders (Denouden et al., <xref ref-type="bibr" rid="B10">2018</xref>), generative models (Ahmadian and Lindsten, <xref ref-type="bibr" rid="B1">2021</xref>), self-supervised training (Hendrycks et al., <xref ref-type="bibr" rid="B17">2019</xref>; Bergman and Hoshen, <xref ref-type="bibr" rid="B4">2020</xref>; Tack et al., <xref ref-type="bibr" rid="B53">2020</xref>; Sehwag et al., <xref ref-type="bibr" rid="B47">2021</xref>) or pretrained feature extractors (Xiao et al., <xref ref-type="bibr" rid="B62">2021</xref>). They then perform detection in lower-dimensional space, typically with simple techniques like fitting one-class Support Vector Machines or Gaussian Mixture Models.</p>
</sec>
<sec>
<title>4.2 Limitation and future work</title>
<p>One limitation of LMD is the speed. Vanilla diffusion models have a time-consuming denoising process that involves a large number of sampling steps. Therefore, similar to other diffusion-based approaches for various tasks (Meng et al., <xref ref-type="bibr" rid="B33">2021</xref>; Lugmayr et al., <xref ref-type="bibr" rid="B31">2022</xref>; Saharia et al., <xref ref-type="bibr" rid="B43">2022</xref>), LMD is currently not well-suited for real-time OOD detection. Several recent works have proposed methods to accelerate the sampling process of pre-trained diffusion models through noise rescaling (Nichol and Dhariwal, <xref ref-type="bibr" rid="B38">2021</xref>), sampler optimization (Watson et al., <xref ref-type="bibr" rid="B57">2022</xref>), or numerical methods (Liu et al., <xref ref-type="bibr" rid="B30">2022</xref>; Wizadwongsa and Suwajanakorn, <xref ref-type="bibr" rid="B58">2023</xref>). One future direction is to harness these methods to expedite LMD&#x00027;s detection.</p>
<p>Another potential extension is to utilize more advanced methods for aggregating reconstruction distances from multiple reconstructions, or even under different masks or distance metrics. As briefly discussed in Section 4.1, this can involve integrating typicality test approaches such as multiple hypothesis testing or learning density models (Nalisnick et al., <xref ref-type="bibr" rid="B36">2019</xref>; Morningstar et al., <xref ref-type="bibr" rid="B34">2021</xref>; Bergamin et al., <xref ref-type="bibr" rid="B3">2022</xref>).</p>
</sec>
</sec>
<sec sec-type="conclusions" id="s5">
<title>5 Conclusion</title>
<p>We propose a novel method, <italic>Lift, Map, Detect</italic> (LMD), for unsupervised out-of-distribution detection. LMD leverages the diffusion model&#x00027;s strong ability in mapping images onto its training manifold, and detects images with large distance between the original and mapped images as OOD. Our extensive experiments and analysis show that LMD achives strong performance for various image distributions with different characteristics. Some future directions of improvement include accelerating LMD&#x00027;s speed and leveraging advanced aggregation for reconstruction distance.</p>
</sec>
<sec sec-type="data-availability" id="s6">
<title>Data availability statement</title>
<p>Publicly available datasets were analyzed in this study. This data can be found here: CIFAR10, CIFAR100, SVHN, MNIST, KMNIST, and FashionMNIST: can be accessed through <ext-link ext-link-type="uri" xlink:href="https://pytorch.org/vision/stable/datasets.html">https://pytorch.org/vision/stable/datasets.html</ext-link>. CelebA-HQ: <ext-link ext-link-type="uri" xlink:href="https://github.com/tkarras/progressive_growing_of_gans">https://github.com/tkarras/progressive_growing_of_gans</ext-link>. ImageNet: <ext-link ext-link-type="uri" xlink:href="https://www.image-net.org/">https://www.image-net.org/</ext-link>. LSUN bedroom and LSUN classroom: <ext-link ext-link-type="uri" xlink:href="https://github.com/fyu/lsun">https://github.com/fyu/lsun</ext-link>.</p>
</sec>
<sec sec-type="ethics-statement" id="s7">
<title>Ethics statement</title>
<p>Written informed consent was not obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article, because these human face images are either from the public datasets CelebA-HQ and FFHQ, which are widely used in the machine learning and computer vision communities, or synthetic faces created by generative models.</p>
</sec>
<sec sec-type="author-contributions" id="s8">
<title>Author contributions</title>
<p>ZL and JZ contributed to the design of the research, performed the experiments, and wrote the manuscript. KW is the PhD supervisor of ZL and JZ, he conceptualized and directed the research, and revised the manuscript. All authors approved the submitted version.</p>
</sec>
</body>
<back>
<sec sec-type="funding-information" id="s9">
<title>Funding</title>
<p>The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. This research was supported by grants from DARPA AIE program, Geometries of Learning (HR00112290078), the Natural Sciences and Engineering Research Council of Canada (NSERC) (567916), the National Science Foundation NSF (IIS-2107161, III1526012, IIS-1149882, and IIS-1724282), and the Cornell Center for Materials Research with funding from the NSF MRSEC program (DMR-1719875).</p>
</sec>
<ack><p>We would like to thank Yufan Wang for helping with literature search and initial setups of some of the baselines.</p>
</ack>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>We use the implementation of <ext-link ext-link-type="uri" xlink:href="https://github.com/richzhang/PerceptualSimilarity">https://github.com/richzhang/PerceptualSimilarity</ext-link>.</p></fn>
<fn id="fn0002"><p><sup>2</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/yang-song/score_sde_pytorch">https://github.com/yang-song/score_sde_pytorch</ext-link></p></fn>
<fn id="fn0003"><p><sup>3</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/XavierXiao/Likelihood-Regret">https://github.com/XavierXiao/Likelihood-Regret</ext-link></p></fn>
<fn id="fn0004"><p><sup>4</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/microsoft/SimMIM">https://github.com/microsoft/SimMIM</ext-link></p></fn>
<fn id="fn0005"><p><sup>5</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/IcarusWizard/MAE">https://github.com/IcarusWizard/MAE</ext-link></p></fn>
<fn id="fn0006"><p><sup>6</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/advimman/lama">https://github.com/advimman/lama</ext-link></p></fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ahmadian</surname> <given-names>A.</given-names></name> <name><surname>Lindsten</surname> <given-names>F.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Likelihood-free out-of-distribution detection with invertible generative models,&#x0201D;</article-title> in <source>IJCAI</source>, 2119&#x02013;2125. <pub-id pub-id-type="doi">10.24963/ijcai.2021/292</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Alaluf</surname> <given-names>Y.</given-names></name> <name><surname>Patashnik</surname> <given-names>O.</given-names></name> <name><surname>Cohen-Or</surname> <given-names>D.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Restyle: a residual-based stylegan encoder via iterative refinement,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source>, 6711&#x02013;6720. <pub-id pub-id-type="doi">10.1109/ICCV48922.2021.00664</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bergamin</surname> <given-names>F.</given-names></name> <name><surname>Mattei</surname> <given-names>P.-A.</given-names></name> <name><surname>Havtorn</surname> <given-names>J. D.</given-names></name> <name><surname>Senetaire</surname> <given-names>H.</given-names></name> <name><surname>Schmutz</surname> <given-names>H.</given-names></name> <name><surname>Maal&#x00153;</surname> <given-names>L.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>&#x0201C;Model-agnostic out-of-distribution detection using combined statistical tests,&#x0201D;</article-title> in <source>International Conference on Artificial Intelligence and Statistics</source> (<publisher-loc>PMLR</publisher-loc>), 10753-10776.</citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bergman</surname> <given-names>L.</given-names></name> <name><surname>Hoshen</surname> <given-names>Y.</given-names></name></person-group> (<year>2020</year>). <article-title>Classification-based anomaly detection for general data</article-title>. <source>arXiv preprint arXiv:2005.02359</source>.</citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bhat</surname> <given-names>S. F.</given-names></name> <name><surname>Alhashim</surname> <given-names>I.</given-names></name> <name><surname>Wonka</surname> <given-names>P.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Adabins: depth estimation using adaptive bins,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, <fpage>4009</fpage>&#x02013;<lpage>4018</lpage>.</citation>
</ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bishop</surname> <given-names>C. M.</given-names></name></person-group> (<year>1994</year>). <article-title>Novelty detection and neural network validation</article-title>. <source>IEE Proc. Vision Image Sig. Proc</source>. <volume>141</volume>, <fpage>217</fpage>&#x02013;<lpage>222</lpage>. <pub-id pub-id-type="doi">10.1049/ip-vis:19941330</pub-id></citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>T.</given-names></name> <name><surname>Kornblith</surname> <given-names>S.</given-names></name> <name><surname>Swersky</surname> <given-names>K.</given-names></name> <name><surname>Norouzi</surname> <given-names>M.</given-names></name> <name><surname>Hinton</surname> <given-names>G. E.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Big self-supervised models are strong semi-supervised learners,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source>, <fpage>22243</fpage>&#x02013;<lpage>22255</lpage>.</citation>
</ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Choi</surname> <given-names>H.</given-names></name> <name><surname>Jang</surname> <given-names>E.</given-names></name> <name><surname>Alemi</surname> <given-names>A. A.</given-names></name></person-group> (<year>2018</year>). <article-title>Waic, but why? Generative ensembles for robust anomaly detection</article-title>. <source>arXiv preprint arXiv:1810.01392</source>.</citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Clanuwat</surname> <given-names>T.</given-names></name> <name><surname>Bober-Irizar</surname> <given-names>M.</given-names></name> <name><surname>Kitamoto</surname> <given-names>A.</given-names></name> <name><surname>Lamb</surname> <given-names>A.</given-names></name> <name><surname>Yamamoto</surname> <given-names>K.</given-names></name> <name><surname>Ha</surname> <given-names>D.</given-names></name></person-group> (<year>2018</year>). <article-title>Deep learning for classical Japanese literature</article-title>. <source>arXiv preprint arXiv:1812.01718</source>.<pub-id pub-id-type="pmid">32828440</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Denouden</surname> <given-names>T.</given-names></name> <name><surname>Salay</surname> <given-names>R.</given-names></name> <name><surname>Czarnecki</surname> <given-names>K.</given-names></name> <name><surname>Abdelzad</surname> <given-names>V.</given-names></name> <name><surname>Phan</surname> <given-names>B.</given-names></name> <name><surname>Vernekar</surname> <given-names>S.</given-names></name></person-group> (<year>2018</year>). <article-title>Improving reconstruction autoencoder out-of-distribution detection with mahalanobis distance</article-title>. <source>arXiv preprint arXiv:1812.02765</source>.</citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gong</surname> <given-names>Y.</given-names></name> <name><surname>Liao</surname> <given-names>P.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Zhang</surname> <given-names>L.</given-names></name> <name><surname>Chen</surname> <given-names>G.</given-names></name> <name><surname>Zhu</surname> <given-names>K.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Enlighten-gan for super resolution reconstruction in mid-resolution remote sensing images</article-title>. <source>Rem. Sens</source>. <volume>13</volume>:<fpage>1104</fpage>. <pub-id pub-id-type="doi">10.3390/rs13061104</pub-id></citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Graham</surname> <given-names>M. S.</given-names></name> <name><surname>Pinaya</surname> <given-names>W. H.</given-names></name> <name><surname>Tudosiu</surname> <given-names>P.-D.</given-names></name> <name><surname>Nachev</surname> <given-names>P.</given-names></name> <name><surname>Ourselin</surname> <given-names>S.</given-names></name> <name><surname>Cardoso</surname> <given-names>M. J.</given-names></name></person-group> (<year>2022</year>). <article-title>Denoising diffusion models for out-of-distribution detection</article-title>. <source>arXiv preprint arXiv:2211.07740</source>. <pub-id pub-id-type="doi">10.1109/CVPRW59228.2023.00296</pub-id><pub-id pub-id-type="pmid">38228075</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hamet</surname> <given-names>P.</given-names></name> <name><surname>Tremblay</surname> <given-names>J.</given-names></name></person-group> (<year>2017</year>). <article-title>Artificial intelligence in medicine</article-title>. <source>Metabolism</source> <volume>69</volume>, <fpage>S36</fpage>&#x02013;<lpage>S40</lpage>. <pub-id pub-id-type="doi">10.1016/j.metabol.2017.01.011</pub-id><pub-id pub-id-type="pmid">28126242</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Chen</surname> <given-names>X.</given-names></name> <name><surname>Xie</surname> <given-names>S.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Doll&#x000E1;r</surname> <given-names>P.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;Masked autoencoders are scalable vision learners,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, 16000&#x02013;16009. <pub-id pub-id-type="doi">10.1109/CVPR52688.2022.01553</pub-id></citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hendrycks</surname> <given-names>D.</given-names></name> <name><surname>Gimpel</surname> <given-names>K.</given-names></name></person-group> (<year>2016</year>). <article-title>A baseline for detecting misclassified and out-of-distribution examples in neural networks</article-title>. <source>arXiv preprint arXiv:1610.02136</source>.<pub-id pub-id-type="pmid">38090830</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hendrycks</surname> <given-names>D.</given-names></name> <name><surname>Mazeika</surname> <given-names>M.</given-names></name> <name><surname>Dietterich</surname> <given-names>T.</given-names></name></person-group> (<year>2018</year>). <article-title>Deep anomaly detection with outlier exposure</article-title>. <source>arXiv preprint arXiv:1812.04606</source>.</citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hendrycks</surname> <given-names>D.</given-names></name> <name><surname>Mazeika</surname> <given-names>M.</given-names></name> <name><surname>Kadavath</surname> <given-names>S.</given-names></name> <name><surname>Song</surname> <given-names>D.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Using self-supervised learning can improve model robustness and uncertainty,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source> 32.</citation>
</ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ho</surname> <given-names>J.</given-names></name> <name><surname>Jain</surname> <given-names>A.</given-names></name> <name><surname>Abbeel</surname> <given-names>P.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Denoising diffusion probabilistic models,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source>, <fpage>6840</fpage>&#x02013;<lpage>6851</lpage>.</citation>
</ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>R.</given-names></name> <name><surname>Geng</surname> <given-names>A.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;On the importance of gradients for detecting distributional shifts in the wild,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source>, <fpage>677</fpage>&#x02013;<lpage>689</lpage>.</citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Karras</surname> <given-names>T.</given-names></name> <name><surname>Aila</surname> <given-names>T.</given-names></name> <name><surname>Laine</surname> <given-names>S.</given-names></name> <name><surname>Lehtinen</surname> <given-names>J.</given-names></name></person-group> (<year>2017</year>). <article-title>Progressive growing of gans for improved quality, stability, and variation</article-title>. <source>CoRR, abs/1710.10196</source>.</citation>
</ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Karras</surname> <given-names>T.</given-names></name> <name><surname>Laine</surname> <given-names>S.</given-names></name> <name><surname>Aila</surname> <given-names>T.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;A style-based generator architecture for generative adversarial networks,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, 4401&#x02013;4410. <pub-id pub-id-type="doi">10.1109/CVPR.2019.00453</pub-id></citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kingma</surname> <given-names>D. P.</given-names></name> <name><surname>Dhariwal</surname> <given-names>P.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Glow: generative flow with invertible 1x1 convolutions,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source> 31.</citation>
</ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kirichenko</surname> <given-names>P.</given-names></name> <name><surname>Izmailov</surname> <given-names>P.</given-names></name> <name><surname>Wilson</surname> <given-names>A. G.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Why normalizing flows fail to detect out-of-distribution data,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source>, <fpage>20578</fpage>&#x02013;<lpage>20589</lpage>.</citation>
</ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Krizhevsky</surname> <given-names>A.</given-names></name></person-group> (<year>2009</year>). <source>Learning multiple layers of features from tiny images</source>. Technical report.</citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Krizhevsky</surname> <given-names>A.</given-names></name> <name><surname>Sutskever</surname> <given-names>I.</given-names></name> <name><surname>Hinton</surname> <given-names>G. E.</given-names></name></person-group> (<year>2012</year>). <article-title>&#x0201C;Imagenet classification with deep convolutional neural networks,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source> 25.</citation>
</ref>
<ref id="B26">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>LeCun</surname> <given-names>Y.</given-names></name> <name><surname>Cortes</surname> <given-names>C.</given-names></name> <name><surname>Burges</surname> <given-names>C.</given-names></name></person-group> (<year>2010</year>). <article-title>Mnist handwritten digit database</article-title>. <source>ATT Labs</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="http://yann.lecun.com/exdb/mnist">http://yann.lecun.com/exdb/mnist</ext-link> (accessed July 08, 2023).</citation>
</ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>K.</given-names></name> <name><surname>Lee</surname> <given-names>K.</given-names></name> <name><surname>Lee</surname> <given-names>H.</given-names></name> <name><surname>Shin</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;A simple unified framework for detecting out-of-distribution samples and adversarial attacks,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source> 31.</citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>D.</given-names></name> <name><surname>Chen</surname> <given-names>D.</given-names></name> <name><surname>Goh</surname> <given-names>J.</given-names></name> <name><surname>Ng</surname> <given-names>S.- K.</given-names></name></person-group> (<year>2018</year>). <article-title>Anomaly detection with generative adversarial networks for multivariate time series</article-title>. <source>arXiv preprint arXiv:1809.04758</source>.</citation>
</ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liang</surname> <given-names>S.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Srikant</surname> <given-names>R.</given-names></name></person-group> (<year>2017</year>). <article-title>Enhancing the reliability of out-of-distribution image detection in neural networks</article-title>. <source>arXiv preprint arXiv:1706.02690</source>.</citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>L.</given-names></name> <name><surname>Ren</surname> <given-names>Y.</given-names></name> <name><surname>Lin</surname> <given-names>Z.</given-names></name> <name><surname>Zhao</surname> <given-names>Z.</given-names></name></person-group> (<year>2022</year>). <article-title>Pseudo numerical methods for diffusion models on manifolds</article-title>. <source>arXiv preprint arXiv:2202.09778</source>.</citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lugmayr</surname> <given-names>A.</given-names></name> <name><surname>Danelljan</surname> <given-names>M.</given-names></name> <name><surname>Romero</surname> <given-names>A.</given-names></name> <name><surname>Yu</surname> <given-names>F.</given-names></name> <name><surname>Timofte</surname> <given-names>R.</given-names></name> <name><surname>Van Gool</surname> <given-names>L.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;Repaint: inpainting using denoising diffusion probabilistic models,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, 11461&#x02013;11471. <pub-id pub-id-type="doi">10.1109/CVPR52688.2022.01117</pub-id></citation>
</ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Maal&#x00153;</surname> <given-names>L.</given-names></name> <name><surname>Fraccaro</surname> <given-names>M.</given-names></name> <name><surname>Li&#x000E9;vin</surname> <given-names>V.</given-names></name> <name><surname>Winther</surname> <given-names>O.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Biva: a very deep hierarchy of latent variables for generative modeling,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source> 32.</citation>
</ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Meng</surname> <given-names>C.</given-names></name> <name><surname>He</surname> <given-names>Y.</given-names></name> <name><surname>Song</surname> <given-names>Y.</given-names></name> <name><surname>Song</surname> <given-names>J.</given-names></name> <name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Zhu</surname> <given-names>J.-Y.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>&#x0201C;Sdedit: guided image synthesis and editing with stochastic differential equations,&#x0201D;</article-title> in <source>International Conference on Learning Representations</source>.</citation>
</ref>
<ref id="B34">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Morningstar</surname> <given-names>W.</given-names></name> <name><surname>Ham</surname> <given-names>C.</given-names></name> <name><surname>Gallagher</surname> <given-names>A.</given-names></name> <name><surname>Lakshminarayanan</surname> <given-names>B.</given-names></name> <name><surname>Alemi</surname> <given-names>A.</given-names></name> <name><surname>Dillon</surname> <given-names>J.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Density of states estimation for out of distribution detection,&#x0201D;</article-title> in <source>International Conference on Artificial Intelligence and Statistics</source> (<publisher-loc>PMLR</publisher-loc>), <fpage>3232</fpage>&#x02013;<lpage>3240</lpage>.</citation>
</ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nalisnick</surname> <given-names>E.</given-names></name> <name><surname>Matsukawa</surname> <given-names>A.</given-names></name> <name><surname>Teh</surname> <given-names>Y. W.</given-names></name> <name><surname>Gorur</surname> <given-names>D.</given-names></name> <name><surname>Lakshminarayanan</surname> <given-names>B.</given-names></name></person-group> (<year>2018</year>). Do deep generative models know what they don&#x00027;t know? <italic>arXiv preprint arXiv:1810.09136</italic>.</citation>
</ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nalisnick</surname> <given-names>E. T.</given-names></name> <name><surname>Matsukawa</surname> <given-names>A.</given-names></name> <name><surname>Teh</surname> <given-names>Y. W.</given-names></name> <name><surname>Lakshminarayanan</surname> <given-names>B.</given-names></name></person-group> (<year>2019</year>). <article-title>Detecting out-of-distribution inputs to deep generative models using a test for typicality</article-title>. <source>arXiv preprint arXiv:1906.02994</source>.</citation>
</ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Netzer</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>T.</given-names></name> <name><surname>Coates</surname> <given-names>A.</given-names></name> <name><surname>Bissacco</surname> <given-names>A.</given-names></name> <name><surname>Wu</surname> <given-names>B.</given-names></name> <name><surname>Ng</surname> <given-names>A. Y.</given-names></name></person-group> (<year>2011</year>). <article-title>&#x0201C;Reading digits in natural images with unsupervised feature learning,&#x0201D;</article-title> in <source>NIPS Workshop on Deep Learning and Unsupervised Feature Learning</source>, 7.</citation>
</ref>
<ref id="B38">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Nichol</surname> <given-names>A. Q.</given-names></name> <name><surname>Dhariwal</surname> <given-names>P.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Improved denoising diffusion probabilistic models,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source> (<publisher-loc>PMLR</publisher-loc>), <fpage>8162</fpage>&#x02013;<lpage>8171</lpage>.</citation>
</ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ren</surname> <given-names>J.</given-names></name> <name><surname>Liu</surname> <given-names>P. J.</given-names></name> <name><surname>Fertig</surname> <given-names>E.</given-names></name> <name><surname>Snoek</surname> <given-names>J.</given-names></name> <name><surname>Poplin</surname> <given-names>R.</given-names></name> <name><surname>Depristo</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>&#x0201C;Likelihood ratios for out-of-distribution detection,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source> 32.</citation>
</ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rigano</surname> <given-names>C.</given-names></name></person-group> (<year>2019</year>). <article-title>Using artificial intelligence to address criminal justice needs</article-title>. <source>Natl. Inst. Justice J</source>. <volume>280</volume>, <fpage>1</fpage>&#x02013;<lpage>10</lpage>.<pub-id pub-id-type="pmid">36871544</pub-id></citation></ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ruff</surname> <given-names>L.</given-names></name> <name><surname>Vandermeulen</surname> <given-names>R. A.</given-names></name> <name><surname>G&#x000F6;rnitz</surname> <given-names>N.</given-names></name> <name><surname>Binder</surname> <given-names>A.</given-names></name> <name><surname>M&#x000FC;ller</surname> <given-names>E.</given-names></name> <name><surname>M&#x000FC;ller</surname> <given-names>K.-R.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Deep semi-supervised anomaly detection</article-title>. <source>arXiv preprint arXiv:1906.02694</source>.</citation>
</ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Russakovsky</surname> <given-names>O.</given-names></name> <name><surname>Deng</surname> <given-names>J.</given-names></name> <name><surname>Su</surname> <given-names>H.</given-names></name> <name><surname>Krause</surname> <given-names>J.</given-names></name> <name><surname>Satheesh</surname> <given-names>S.</given-names></name> <name><surname>Ma</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>ImageNet large scale visual recognition challenge</article-title>. <source>Int. J. Comput. Vis</source>. <volume>115</volume>, <fpage>211</fpage>&#x02013;<lpage>252</lpage>. <pub-id pub-id-type="doi">10.1007/s11263-015-0816-y</pub-id></citation>
</ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Saharia</surname> <given-names>C.</given-names></name> <name><surname>Ho</surname> <given-names>J.</given-names></name> <name><surname>Chan</surname> <given-names>W.</given-names></name> <name><surname>Salimans</surname> <given-names>T.</given-names></name> <name><surname>Fleet</surname> <given-names>D. J.</given-names></name> <name><surname>Norouzi</surname> <given-names>M.</given-names></name></person-group> (<year>2022</year>). <article-title>Image super-resolution via iterative refinement</article-title>. <source>IEEE Trans. Patt. Analy. Mach. Intell</source>. <volume>45</volume>, <fpage>4713</fpage>&#x02013;<lpage>4726</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2022.3204461</pub-id><pub-id pub-id-type="pmid">36094974</pub-id></citation></ref>
<ref id="B44">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sakurada</surname> <given-names>M.</given-names></name> <name><surname>Yairi</surname> <given-names>T.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Anomaly detection using autoencoders with nonlinear dimensionality reduction,&#x0201D;</article-title> in <source>Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis</source>, 4&#x02013;11. <pub-id pub-id-type="doi">10.1145/2689746.2689747</pub-id></citation>
</ref>
<ref id="B45">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Salimans</surname> <given-names>T.</given-names></name> <name><surname>Karpathy</surname> <given-names>A.</given-names></name> <name><surname>Chen</surname> <given-names>X.</given-names></name> <name><surname>Kingma</surname> <given-names>D. P.</given-names></name></person-group> (<year>2017</year>). <article-title>Pixelcnn&#x0002B;&#x0002B;: improving the pixelcnn with discretized logistic mixture likelihood and other modifications</article-title>. <source>arXiv preprint arXiv:1701.05517</source>.</citation>
</ref>
<ref id="B46">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Schlegl</surname> <given-names>T.</given-names></name> <name><surname>Seeb&#x000F6;ck</surname> <given-names>P.</given-names></name> <name><surname>Waldstein</surname> <given-names>S. M.</given-names></name> <name><surname>Schmidt-Erfurth</surname> <given-names>U.</given-names></name> <name><surname>Langs</surname> <given-names>G.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Unsupervised anomaly detection with generative adversarial networks to guide marker discovery,&#x0201D;</article-title> in <source>International Conference on Information Processing in Medical Imaging</source> (<publisher-loc>Springer</publisher-loc>), <fpage>146</fpage>&#x02013;<lpage>157</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-59050-9_12</pub-id></citation>
</ref>
<ref id="B47">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sehwag</surname> <given-names>V.</given-names></name> <name><surname>Chiang</surname> <given-names>M.</given-names></name> <name><surname>Mittal</surname> <given-names>P.</given-names></name></person-group> (<year>2021</year>). <article-title>Ssd: A unified framework for self-supervised outlier detection</article-title>. <source>arXiv preprint arXiv:2103.12051</source>.</citation>
</ref>
<ref id="B48">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Serr&#x000E0;</surname> <given-names>J.</given-names></name> <name><surname>&#x000C1;lvarez</surname> <given-names>D.</given-names></name> <name><surname>G&#x000F3;mez</surname> <given-names>V.</given-names></name> <name><surname>Slizovskaia</surname> <given-names>O.</given-names></name> <name><surname>N&#x000FA; nez</surname> <given-names>J. F.</given-names></name> <name><surname>Luque</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). <article-title>Input complexity and out-of-distribution detection with likelihood-based generative models</article-title>. <source>arXiv preprint arXiv:1909.11480</source>.</citation>
</ref>
<ref id="B49">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sohl-Dickstein</surname> <given-names>J.</given-names></name> <name><surname>Weiss</surname> <given-names>E.</given-names></name> <name><surname>Maheswaranathan</surname> <given-names>N.</given-names></name> <name><surname>Ganguli</surname> <given-names>S.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Deep unsupervised learning using nonequilibrium thermodynamics,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source> (<publisher-loc>PMLR</publisher-loc>), <fpage>2256</fpage>&#x02013;<lpage>2265</lpage>.</citation>
</ref>
<ref id="B50">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Song</surname> <given-names>Y.</given-names></name> <name><surname>Ermon</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Generative modeling by estimating gradients of the data distribution,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source> 32.</citation>
</ref>
<ref id="B51">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Song</surname> <given-names>Y.</given-names></name> <name><surname>Sohl-Dickstein</surname> <given-names>J.</given-names></name> <name><surname>Kingma</surname> <given-names>D. P.</given-names></name> <name><surname>Kumar</surname> <given-names>A.</given-names></name> <name><surname>Ermon</surname> <given-names>S.</given-names></name> <name><surname>Poole</surname> <given-names>B.</given-names></name></person-group> (<year>2020</year>). <article-title>Score-based generative modeling through stochastic differential equations</article-title>. <source>arXiv preprint arXiv:2011.13456</source>.</citation>
</ref>
<ref id="B52">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Suvorov</surname> <given-names>R.</given-names></name> <name><surname>Logacheva</surname> <given-names>E.</given-names></name> <name><surname>Mashikhin</surname> <given-names>A.</given-names></name> <name><surname>Remizova</surname> <given-names>A.</given-names></name> <name><surname>Ashukha</surname> <given-names>A.</given-names></name> <name><surname>Silvestrov</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>&#x0201C;Resolution-robust large mask inpainting with fourier convolutions,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</source>, 2149&#x02013;2159. <pub-id pub-id-type="doi">10.1109/WACV51458.2022.00323</pub-id><pub-id pub-id-type="pmid">37235458</pub-id></citation></ref>
<ref id="B53">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tack</surname> <given-names>J.</given-names></name> <name><surname>Mo</surname> <given-names>S.</given-names></name> <name><surname>Jeong</surname> <given-names>J.</given-names></name> <name><surname>Shin</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;CSI: novelty detection via contrastive learning on distributionally shifted instances,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source> <fpage>11839</fpage>&#x02013;<lpage>11852</lpage>.</citation>
</ref>
<ref id="B54">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Toda</surname> <given-names>R.</given-names></name> <name><surname>Teramoto</surname> <given-names>A.</given-names></name> <name><surname>Kondo</surname> <given-names>M.</given-names></name> <name><surname>Imaizumi</surname> <given-names>K.</given-names></name> <name><surname>Saito</surname> <given-names>K.</given-names></name> <name><surname>Fujita</surname> <given-names>H.</given-names></name></person-group> (<year>2022</year>). <article-title>Lung cancer ct image generation from a free-form sketch using style-based pix2pix for data augmentation</article-title>. <source>Sci. Rep</source>. <volume>12</volume>:<fpage>12867</fpage>. <pub-id pub-id-type="doi">10.1038/s41598-022-16861-5</pub-id><pub-id pub-id-type="pmid">35896575</pub-id></citation></ref>
<ref id="B55">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>H.</given-names></name> <name><surname>Li</surname> <given-names>Z.</given-names></name> <name><surname>Feng</surname> <given-names>L.</given-names></name> <name><surname>Zhang</surname> <given-names>W.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;Vim: out-of-distribution with virtual-logit matching,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, 4921&#x02013;4930. <pub-id pub-id-type="doi">10.1109/CVPR52688.2022.00487</pub-id></citation>
</ref>
<ref id="B56">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Simoncelli</surname> <given-names>E. P.</given-names></name> <name><surname>Bovik</surname> <given-names>A. C.</given-names></name></person-group> (<year>2003</year>). <article-title>&#x0201C;Multiscale structural similarity for image quality assessment,&#x0201D;</article-title> in <source>The Thrity-Seventh Asilomar Conference on Signals, Systems and Computers</source> (<publisher-loc>IEEE</publisher-loc>), <fpage>1398</fpage>&#x02013;<lpage>1402</lpage>.</citation>
</ref>
<ref id="B57">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Watson</surname> <given-names>D.</given-names></name> <name><surname>Chan</surname> <given-names>W.</given-names></name> <name><surname>Ho</surname> <given-names>J.</given-names></name> <name><surname>Norouzi</surname> <given-names>M.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;Learning fast samplers for diffusion models by differentiating through sample quality,&#x0201D;</article-title> in <source>International Conference on Learning Representations</source>.</citation>
</ref>
<ref id="B58">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wizadwongsa</surname> <given-names>S.</given-names></name> <name><surname>Suwajanakorn</surname> <given-names>S.</given-names></name></person-group> (<year>2023</year>). <article-title>Accelerating guided diffusion sampling with splitting numerical methods</article-title>. <source>arXiv preprint arXiv:2301.11558</source>.</citation>
</ref>
<ref id="B59">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xia</surname> <given-names>Y.</given-names></name> <name><surname>Cao</surname> <given-names>X.</given-names></name> <name><surname>Wen</surname> <given-names>F.</given-names></name> <name><surname>Hua</surname> <given-names>G.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Learning discriminative reconstructions for unsupervised outlier removal,&#x0201D;</article-title> in <source>Proceedings of the IEEE International Conference on Computer Vision</source>, 1511&#x02013;1519. <pub-id pub-id-type="doi">10.1109/ICCV.2015.177</pub-id></citation>
</ref>
<ref id="B60">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xiao</surname> <given-names>H.</given-names></name> <name><surname>Rasul</surname> <given-names>K.</given-names></name> <name><surname>Vollgraf</surname> <given-names>R.</given-names></name></person-group> (<year>2017</year>). <article-title>Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms</article-title>. <source>CoRR, abs/1708.07747</source>.</citation>
</ref>
<ref id="B61">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xiao</surname> <given-names>Z.</given-names></name> <name><surname>Yan</surname> <given-names>Q.</given-names></name> <name><surname>Amit</surname> <given-names>Y.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Likelihood regret: an out-of-distribution detection score for variational auto-encoder,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source>, <fpage>20685</fpage>&#x02013;<lpage>20696</lpage>.</citation>
</ref>
<ref id="B62">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xiao</surname> <given-names>Z.</given-names></name> <name><surname>Yan</surname> <given-names>Q.</given-names></name> <name><surname>Amit</surname> <given-names>Y.</given-names></name></person-group> (<year>2021</year>). Do we really need to learn representations from in-domain data for outlier detection? <italic>arXiv preprint arXiv:2105.09270</italic>.</citation>
</ref>
<ref id="B63">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xie</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Cao</surname> <given-names>Y.</given-names></name> <name><surname>Lin</surname> <given-names>Y.</given-names></name> <name><surname>Bao</surname> <given-names>J.</given-names></name> <name><surname>Yao</surname> <given-names>Z.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>&#x0201C;Simmim: a simple framework for masked image modeling,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, 9653&#x02013;9663. <pub-id pub-id-type="doi">10.1109/CVPR52688.2022.00943</pub-id></citation>
</ref>
<ref id="B64">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>L.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Song</surname> <given-names>Y.</given-names></name> <name><surname>Hong</surname> <given-names>S.</given-names></name> <name><surname>Xu</surname> <given-names>R.</given-names></name> <name><surname>Zhao</surname> <given-names>Y.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>Diffusion models: a comprehensive survey of methods and applications</article-title>. <source>arXiv preprint arXiv:2209.00796</source>.</citation>
</ref>
<ref id="B65">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>F.</given-names></name> <name><surname>Zhang</surname> <given-names>Y.</given-names></name> <name><surname>Song</surname> <given-names>S.</given-names></name> <name><surname>Seff</surname> <given-names>A.</given-names></name> <name><surname>Xiao</surname> <given-names>J.</given-names></name></person-group> (<year>2015</year>). <article-title>Lsun: construction of a large-scale image dataset using deep learning with humans in the loop</article-title>. <source>arXiv preprint arXiv:1506.03365</source>.</citation>
</ref>
<ref id="B66">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>K. A.</given-names></name> <name><surname>Cuesta-Infante</surname> <given-names>A.</given-names></name> <name><surname>Xu</surname> <given-names>L.</given-names></name> <name><surname>Veeramachaneni</surname> <given-names>K.</given-names></name></person-group> (<year>2019</year>). <article-title>Steganogan: high capacity image steganography with gans</article-title>. <source>arXiv preprint arXiv:1901.03892</source>.</citation>
</ref>
<ref id="B67">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>R.</given-names></name> <name><surname>Isola</surname> <given-names>P.</given-names></name> <name><surname>Efros</surname> <given-names>A. A.</given-names></name> <name><surname>Shechtman</surname> <given-names>E.</given-names></name> <name><surname>Wang</surname> <given-names>O.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;The unreasonable effectiveness of deep features as a perceptual metric,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>, 586&#x02013;595. <pub-id pub-id-type="doi">10.1109/CVPR.2018.00068</pub-id></citation>
</ref>
<ref id="B68">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>C.</given-names></name> <name><surname>Paffenroth</surname> <given-names>R. C.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Anomaly detection with robust deep autoencoders,&#x0201D;</article-title> in <source>Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>, 665&#x02013;674. <pub-id pub-id-type="doi">10.1145/3097983.3098052</pub-id></citation>
</ref>
<ref id="B69">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zong</surname> <given-names>B.</given-names></name> <name><surname>Song</surname> <given-names>Q.</given-names></name> <name><surname>Min</surname> <given-names>M. R.</given-names></name> <name><surname>Cheng</surname> <given-names>W.</given-names></name> <name><surname>Lumezanu</surname> <given-names>C.</given-names></name> <name><surname>Cho</surname> <given-names>D.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>&#x0201C;Deep autoencoding gaussian mixture model for unsupervised anomaly detection,&#x0201D;</article-title> in <source>International Conference on Learning Representations</source>.</citation>
</ref>
</ref-list>
</back>
</article>