Probabilistic Noise2Void: Unsupervised Content-Aware Denoising

Today, Convolutional Neural Networks (CNNs) are the leading method for image denoising. They are traditionally trained on pairs of images, which are often hard to obtain for practical applications. This motivates self-supervised training methods such as Noise2Void~(N2V) that operate on single noisy images. Self-supervised methods are, unfortunately, not competitive with models trained on image pairs. Here, we present 'Probabilistic Noise2Void' (PN2V), a method to train CNNs to predict per-pixel intensity distributions. Combining these with a suitable description of the noise, we obtain a complete probabilistic model for the noisy observations and true signal in every pixel. We evaluate PN2V on publicly available microscopy datasets, under a broad range of noise regimes, and achieve competitive results with respect to supervised state-of-the-art methods.


Introduction
Image restoration is the problem of reconstructing an image from a corrupted version of itself. Recent work shows how CNNs can be used to build powerful content-aware image restoration (CARE) pipelines [11,10,12,13,6,4,1,5]. However, for supervised CARE models, such as [10], pairs of clean and noisy images are required.
For many application areas, it is impractical or impossible to acquire clean ground-truth images [2]. In such cases, Noise2Noise (N2N) training [6] relaxes the problem, only requiring two noisy instances of the same data. Unfortunately, even the acquisition of two noisy realizations of the same image content is often difficult [2]. Self-supervised training methods, such as Noise2Void (N2V) [4], are a promising alternative, as they operate exclusively on single noisy images [4,1,5]. This is enabled by excluding/masking the center (blind-spot) of the network's receptive fields. Self-supervised training assumes that the noise is pixel-wise independent and that the true intensity of a pixel can be predicted from local image context, excluding before-mentioned blind-spots [4]. For many applications, especially in the context of microscopy images, the first assumption is fulfilled, but the second assumption offers room for improvements [5].
Hence, self-supervised models can often not compete with supervised training [4]. In concurrent work, by Laine et al. [5], this problem was elegantly addressed by assuming a Gaussian noise model and predicting Gaussian intensity distributions per pixel. The authors also showed that the same approach an be applied to other noise distributions, which can be approximated as Gaussian, or can be described analytically.
Here, we introduce a new training approach called Probabilistic Noise2Void (PN2V). Similar to [5], PN2V proposes a way to leverage information of the network's blind-spots. However, PN2V is not restricted to Gaussian noise models or Gaussian intensity predictions. More precisely, to compute the posterior distribution of a pixel, we combine (i) a general noise model that can be represented as a histogram (observation likelihood), and (ii) a distribution of possible true pixel intensities (prior), represented by a set of predicted samples.
Having this complete probabilistic model for each pixel, we are now free to chose which statistical estimator to employ. In this work we use MMSE estimates for our final predictions and show that MMSE-PN2V consistently outperformes other self-supervised methods and, in many cases, leads to results that are competitive even with supervised state-of-the-art CARE networks (see below).

Background
Image Formation and the Denoising Task: An image x = (x 1 , . . . , x n ) is the corrupted version of a clean image (signal) s = (s 1 , . . . , s n ). Our goal is to recover the original signal from x, thus implementing a function f (x) =ŝ ≈ s.
In this paper, we assume that each observed pixel value x i is independently drawn from the conditional distribution p(x i |s i ) such that We will refer to p(x i |s i ) as observation likelihood. It is described by an arbitrary noise model. Traditional Training and Noise2Noise: The function f (x) can be implemented by a Fully Convolutional Network (FCN) [7] (see e.g. [11,10,12,6]), a type of CNN that takes an image as input and produces an entire (in this case denoised) image as output. However, in this setup every predicted output pixel s i depends only on a limited receptive field x RF(i) , i.e. a patch of input pixels surrounding it. FCN based image denoising in fact implements f (x) by producing independent predictionsŝ i = g(x RF(i) ; θ) ≈ s i for each pixel i, depending only on x RF(i) instead of on the entire image. The prediction is parametrized by the weights θ of the network. In traditional training, θ are learned from pairs of noisy x j and corresponding clean training images s j , which provide training examples (x j RF(i) , s j i ) consisting of noisy input patches x j RF(i) and their corresponding clean target values s j i .
The parameters θ are traditionally tuned to minimize an empirical risk function such as the average squared distance over all training images j and pixels i. In Noise2Noise [6], Lehtinen et al. show that clean data is in fact not necessary for training and that the same training scheme can be used with noisy data alone. Noise2Noise uses pairs of corresponding noisy training images x j and x ′ j , which are based on the same signal s j , but are corrupted independently by noise (see Eq. 1). Such pairs can for example be acquired by imaging a static sample twice.
cropped from the first image x j and the noisy target x ′ j i extracted from the patch center in the second one x ′ . It is of course impossible for the network to predict the noisy pixel value x ′ j i from the independently corrupted input x j RF(i) . However, assuming the noise is zero centered, i.e. E x ′ j i = s j i , the best achievable prediction is the clean signal s j i and the network will learn to denoise the images it is presented with. Noise2Void Training: In Noise2Void, Krull et al. [4] show that training is still possible when not even noisy training pairs are available. They use single images to extract input and target for their networks. If this was done naively, the network would simply learn the identity transformation, directly outputting the value at the center of each pixel's receptive field. Krull et al. address the issue by effectively removing the central pixel from the networks receptive field. To achieve this, they mask the pixel during training, replacing it with a random value from the vicinity. Thus, a Noise2Void trained network can be seen as a functionŝ i =g(x RF(i) ; θ) ≈ s i , making a prediction for a single pixel based on the modified patchx RF(i) that excludes the central pixel. Such a network can no longer describe the identity, and can be trained from single noisy images.
However, this ability comes at price. The accuracy of the predictions is reduced, as the network has to exclude the central pixel of its receptive field, thus having less information available.
To allow efficient training of a CNN with Noise2Void, Krull et al. simultaneously mask multiple pixels in larger training patches and jointly calculate their gradients.

Method
Maximum Likelihood Training: In PN2V, we build on the idea of masking pixels [4] to obtain a prediction from the modified receptive fieldx RF(i) . However, instead of directly predicting an estimate for each pixel value, PN2V trains a CNN to describe a probability distribution p(s i |x RF(i) ; θ). We will refer to p(s i |x RF(i) ; θ) as prior, as it describes our knowledge of the pixel's signal considering only its surroundings, but not the observation at the pixel itself x i , since it has been excluded fromx RF(i) . We choose a sample based representation for this prior, which will be discussed below.
Remembering that the observed pixels values are drawn independently (Eq. 1), we can combine Eq. 3 with our noise model, and obtain the joint distribution By integrating over all possible clean signals, we can derive the probability of observing the pixel value x i , given we know its surroundingsx RF(i) . We can now view CNN training as an unsupervised learning task. Following the maximum likelihood approach, we tune θ to minimize Note that in order to improve readability, we from here on omit the index j, and refrain from explicitly referring to the training image. Sample Based Prior: To allow an efficient optimization of Eq. 6 we choose a sample based representation of our prior p(s i |x RF(i) ; θ). For every pixel i, our network directly predicts K = 800 output values s k i , which we interpret as independent samples, drawn from p(s i |x RF(i) ; θ). We can now approximate Eq. 6 as During training we use Eq. 7 as loss function. Note that the summation over k can be efficiently performed on the GPU. Since every sample s k i is effectively a function of the parameters θ, we can calculate the derivative with respect to any network parameter θ l as Minimal Mean Squared Error (MMSE) Inference: Assuming our network is sufficiently trained, we are now interested in processing images and finding sensible estimates for every pixel's signal s i . Based on our probabilistic model, we derive the MMSE estimate, which is defined as where p(s i |x RF(i) ) is the posterior distribution of the signal given the complete surrounding patch. The posterior is proportional to the joint distribution given in Eq. 4. We can thus approximate s MMSE i by weighing our predicted samples with the corresponding observation likelihood and calculating their average Figure 1 illustrates the process and shows the involved distributions for a concrete pixel in a real example.

Experiments
The results of our experiments can be found in Table 1. In Figure 2 we provide qualitative results on realistic test images. Datasets: We evaluate PN2V on datasets provided by Zhang et al. in [13]. Since PN2V is not yet implemented for multi-channel images, we use all available single-channel datasets. These datasets are recorded with different samples and under different imaging conditions. Each of them consists of a total of 20 fields of view (FOVs). One FOV is reserved for testing. The other 19 are used for training and validation.
For each FOV, the data is composed of 50 raw microscopy images, each containing different noise realizations of the same static sample. For every FOV, Zhang et al. additionally simulate four reduced noise regimes (NRs) by averaging different subsets of 2, 4, 8, and 16 raw images [13]. We will refer to the raw images as NR1 and to the regimes created through averaging 2, 4, 8, and 16 images as NR2, NR3, NR4, and NR5, respectively.
We find that in one of the datasets (Two-Photon Mice) the average pixel intensity fluctuates heavily over the course of the 50 images, even though it should be approximately constant for each FOV. Considering that a single ground truth image (the average) is used for the evaluation on all 50 images, this leads to fluctuations and distortions in the calculated PSNR values, which are also reflected in the comparatively high standard errors (SEMs) for all methods, see Table 1.
To account for this inconsistency in the data, we additionally use a variant of the PSNR calculation that is invariant to arbitrary shifts and linear transformations in the ground truth signal. These values are marked by an asterisk (*). Details can also be found in the supplementary material 1 . Acquiring Noise Models: In our experiments, we use a histogram based method to measure and describe the noise distribution p(x i |s i ). We start with corresponding pairs of clean s j and noisy x j images. Here, we use the available training data from [13] for this purpose. However, in general these images could show an arbitrary test pattern that covers the desired range of values and do not have to resemble the sample we are interested in. We construct a 2D histogram (256×256 bins), with the y-and x-axis corresponding to the clean s j i and noisy pixel values x j i , respectively. By normalizing every row, we obtain a a probability distribution for every signal. Considering Eq. 7, we require our model to be differentiable with respect to the s i . To ensure this differentiability, we linearly interpolate along the y-axis of the normalized histogram, obtaining a model for p(x i |s i ) that is continuous in s i . Evaluated Denoising Models/Methods: To put the denoising results of PN2V into perspective, we compare to various state-of-the-art baselines, including the strongest published numbers on the datasets. U-Net (PN2V): We use a standard U-Net [9]. Our network has a U-Net depth of 3, 1 input channel, and K = 800 output channels, which are interpreted as samples. We use a initial feature channel number of 64 in the fist U-Net layer. We train our network separately for each NR in each dataset. We use the same masking technique as [4]. Further details and training parameters can be found in the supplementary material 1 . U-Net (N2V): We use the same network architecture as for U-Net (PN2V) but modify the outputlayer to produce only a single prediction instead of K = 800. The network is trained using the Noise2Void scheme as described in [4]. All training parameters are identical to U-Net (PN2V). U-Net (trad.): We use the exact same architecture as U-Net (N2V), but train the network using the available ground-truth data and the standard MSE loss (see Eq. 2). All training parameters are identical to U-Net (PN2V) and U-Net (N2V). VST+BM3D: Numbers are taken from [13]. The authors fit a Poisson Gaussian noise model to the data and then apply a combination of variance-stabilizing transformation (VST) [8] and BM3D filtering [3]. DnCNN: Numbers are taken from [13]. DnCNN [12] is an established CNN based denoising architecture that is trained in a supervised fashion. N2N: Numbers are taken from [13]. The authors train a network according to the N2N scheme, using an architecture similar to the one presented in [6].   Table 1. Results of PN2V and baseline methods on three datasets from [13]. Comparisons are performed on five noise regimes (NR1-NR5). Numbers report PSNR (dB) ± 2 SEM, averaged over all 50 images in each NR. We group all supervised/non-supervised methods and mark the highest values in bold. Rows marked by asterisk ( * ) use a scaleand shift-invariant PSNR calculation to address inconsistent acquisitions in the Two-Photon mice dataset (see main text). Comp. times: All CNN based methods required below 1s per image (NVIDIA TITAN Xp); VST+BM3D required on avg. 6.22s.

Discussion
We have introduced PN2V, a fully probabilistic approach extending self-supervised CARE training. PN2V makes use of an arbitrary noise model which can be determined by analyzing any set of available images that are subject to the same type of noise. This is a decisive advantage compared to state-of-the-art supervised methods and allows PN2V to be used for many practical applications. The much improved performance of PN2V lies consistently beyond selfsupervised training and can often compete with state-of-the-art supervised methods. We see a plethora of unique applications for PN2V, for example in challenging low-light conditions, where noise typically is the limiting factor for downstream analysis.