A lifted Bregman formulation for the inversion of deep neural networks

We propose a novel framework for the regularized inversion of deep neural networks. The framework is based on the authors' recent work on training feed-forward neural networks without the differentiation of activation functions. The framework lifts the parameter space into a higher dimensional space by introducing auxiliary variables, and penalizes these variables with tailored Bregman distances. We propose a family of variational regularizations based on these Bregman distances, present theoretical results and support their practical application with numerical examples. In particular, we present the first convergence result (to the best of our knowledge) for the regularized inversion of a single-layer perceptron that only assumes that the solution of the inverse problem is in the range of the regularization operator, and that shows that the regularized inverse provably converges to the true inverse if measurement errors converge to zero.


INTRODUCTION
Neural networks are computing systems that have revolutionised a wide range of research domains over the past decade and outperformed many traditional machine learning approaches (cf.(LeCun et al., 2015;Goodfellow et al., 2016)).This performance often comes at the cost of interpretability (or rather a lack thereof) of the outputs that a neural network produces for given inputs.As a consequence, a lot of research has focused on understanding representations of neural networks and on developing strategies to interpret these representations, predominantly with saliency maps (Simonyan et al., 2013;Fong and Vedaldi, 2017;Chang et al., 2018;Fong et al., 2019).An alternative approach focuses on understanding deep image representations by inverting them (Mahendran and Vedaldi, 2015).The authors propose a total-variationbased variational optimisation method that aims to infer the network input from the network output with regularised inversion.
While several approaches for the inversion of neural networks have been proposed especially in the context of generative modelling (see for example (Behrmann et al., 2019(Behrmann et al., , 2021) ) in the context of normalising flows, (Xia et al., 2022) in the context of generative adversarial networks and (Gal et al., 2022) in the context of probabilistic diffusion models), an important aspect, which is often overlooked, is that invertible operations alone are not automatically stable with respect to small variations in the data.For example, computing the solution of the heat equation after a fixed termination time is stable with respect to variations in the initial condition, but estimating the initial condition from the terminal condition of the heat equation is not stable with respect to perturbations in the terminal condition.This issue cannot be resolved without approximation of the inverse with a family of continuous operators, also known as regularisation.The research field of Inverse and Ill-posed Problems and its branch Regularisation Theory focus strongly on the stable approximation of ill-posed and ill-conditioned inverses via regularisations (Engl et al., 1996) and so-called variational regularisations (Scherzer et al., 2009;Benning and Burger, 2018) that are a special class of (nonlinear) regularisations.The optimisation model proposed in (Mahendran and Vedaldi, 2015) can be considered as a variational regularisation method with total variation regularisation; however, the work in (Mahendran and Vedaldi, 2015) is purely empirical, and to the best of our knowledge no works exist that rigorously prove that the proposed approach is a variational regularisation.
In this work, we propose a novel regularisation framework based on lifting with tailored Bregman distances and prove that the proposed framework is a convergent variational regularisation for the inverse problem of estimating the inputs from single-layer perceptrons or the inverse problem of estimating hidden variables in a multi-layer perceptron sequentially.While there has been substantial work in previous years that focuses on utilising neural networks as nonlinear operators in variational regularisation methods (Lunz et al., 2018;Arridge et al., 2019;Schwab et al., 2019;Li et al., 2020;Mukherjee et al., 2021), this is the first work that provides theoretical guarantees for the stable, model-based inversion of neural networks to the best of our knowledge.
Our contributions are three-fold.1) We propose a novel framework for the regularised inversion of multi-layer perceptrons, respectively feed-forward neural networks, that is based on the lifted Bregman framework recently proposed by the authors in Wang and Benning (2022).2) We show that for the single-layer perceptron case, the proposed variational regularisation approach is a provably convergent regularisation under very mild assumptions.To our knowledge, this is the first time that an inversion method has been proposed that does not just allow to perform inversion empirically, but for which we can prove that the proposed method is a convergent regularisation method without overly restrictive assumptions such as differentiability of the activation function and the presence of a tangential cone condition.3) We propose a proximal first-order optimisation strategy to solve the proposed variational regularisation method and present several numerical examples that support the effectiveness of the proposed model-based regularisation approach.
The paper is structured as follows.In Section 2 we introduce the lifted Bregman formulation for the model-based inversion of feed-forward neural networks.In Section 3 we prove that for the single-layer perceptron case the proposed model is a convergent variational regularisation method and provide general error estimates as well as error estimates for a concrete example of a perceptron with ReLU activation function.In Section 4 we discuss how to implement the proposed variational regularisation computationally for both the single-layer and multi-layer perceptron setting with a generalisation of the primal-dual hybrid gradient method and coordinate descent.Subsequently, we present numerical results that demonstrate empirically that the proposed approach is a model-based regularisation in Section 5, before we conclude this work with a brief section on conclusions and outlook in Section 6.

MODEL-BASED INVERSION OF FEED-FORWARD NETWORKS
Suppose we are given an L-layer feed-forward neural network N : R n × P → R m of the form for input data x ∈ R n and pre-trained parameters Θ ∈ P. Here, {σ l } L l=1 denotes the collection of nonlinear activation functions and f denotes a generic function parametrised by parameters {Θ l } L l=1 .For ease of notation, we use Θ to refer to all parameters {Θ l } L l=1 .For a given network output y ∈ R m , our goal is to solve the inverse problem N (x, Θ) = y for the unknown input x ∈ R n .We propose to approximate the inverse of this nonlinear, potentially ill-posed inverse problem via the minimisation of a lifted Bregman formulation of the form where we assume x 0 = x and x L = y δ for simplicity of notation.The data y δ is a perturbed version of y, for which we assume ).The functions B Ψ l for l = 1, . . ., L are defined as for a proper, convex and lower semi-continuous function Ψ l : R n l → R ∪ {∞}.The notation Last but not least, the function R : R n → R ∪ {∞} is a proper, convex, and lower semi-continuous function that enables us to incorporate a-priori information into the inversion process.The impact of this is controlled by the parameter α > 0.
Please note that the functions B Ψ l have some useful properties and are directly connected to the chosen activation functions {σ l } L l=1 .Following (Wang and Benning, 2022), we observe where σ l : R n l → R n l is the proximal map with respect to Ψ l , i.e.
σ l (z) = arg min for all l ∈ {1, . . ., L}.This means that we will solely focus on feed-forward neural networks with nonlinear activation functions that are proximal maps.
Another useful property is that the functions B Ψ l are continuously differentiable with respect to their second argument.If we define F l x (z) := B Ψ l (x, z), we observe Please note that the family of objective functions B Ψ l satisfies several other interesting properties; we refer the interested reader to (Wang and Benning, 2022, Theorem 10).
For the remainder of this work, we assume that the parametrised functions f are affine-linear in the first argument, with parameters Θ l .A concrete example is the affine-linear transformation f (x, Θ l ) = W l x + b l , for a (weight) matrix W l ∈ R n l ×n l−1 , a (bias) vector b l ∈ R n l and the collection of parameters Θ l = (W l , b l ).
In the next section we show that (2) is a variational regularisation method for L = 1 and prove a convergence rate with which the solution of (2) converges towards the true input of a perceptron when δ converges to zero.

CONVERGENCE ANALYSIS AND ERROR ESTIMATES
In this section we show that the proposed model ( 2) is a convergent variational regularisation for the specific choice L = 1 and the assumption f (x, Θ) = W x + b for Θ = (W, b), which reduces (2) to a variational regularisation model for the perceptron case studied in Wang and Benning (2020).In contrast to Wang and Benning (2020) we are not interested in estimating the perceptron parameters W and b but assume that these are fixed, and that we study the regularisation operator where dom(Ψ) is defined as dom(Ψ) := {y ∈ R m | Ψ(y) < ∞}.We first want to establish under which assumptions (5) is well-defined for all y δ .

Well-definedness
For simplicity, we focus on the finite-dimensional setting with network inputs in R n and outputs in dom(Ψ).However, the following analysis also extends to more general Banach space settings with additional assumptions on the operator W , see for instance (Benning and Burger, 2018, Section 5.1).Following (Benning and Burger, 2018), we assume that R is non-negative and the polar of a proper function, i.e.R = H * for a proper function H : R n → R ∪ {∞}.Note that this automatically implies convexity of R.Moreover, we assume that Ψ is a proper, non-negative and convex function that is continuous on dom(Ψ), which implies that B Ψ is proper, non-negative, convex in its second argument and continuous in its first argument for every y δ ∈ dom(Ψ).Then, for every g ∈ dom(Ψ) there exists x with Last but not least, we assume that R and Ψ are chosen such that for each g ∈ dom(Ψ) and α > 0 we have for constants a, d and a constant c that depends monotonically non-decreasing on all arguments.With these assumptions, we can then verify the following lemma.
THEOREM 1.Let the assumptions outlined in the previous paragraph be satisfied.
is well-defined.
2. The regularisation operator R α as defined in (2) is well-defined in the sense that for every y ∈ dom(Ψ) there exists 3. For every sequence y n → y ∈ dom(Ψ) there exists a subsequence PROOF.The results follow directly from (Benning and Burger, 2018), Lemma 5.5, Theorem 5.6 and Theorem 5.7.The latter statement originally only implies convergence in the weak-star topology; however, since we are in a finite-dimensional Hilbert space, this automatically implies strong convergence here.

Error estimates
Having established that ( 2) is a regularisation operator, we now want to prove that it is also a convergent regularisation operator in the sense of the estimate such that Here, the term D R denotes the (generalised) Bregman distance (or divergence) (cf.(Bregman, 1967;Kiwiel, 1997)) with respect to R, i.e.
The vector x α is a solution of (2) with data y δ for which we assume B Ψ (y δ , y) ≤ δ 2 , and C ≥ 0 is a constant.The vector x † is an element of the selection operator as specified in Lemma 1.1, i.e.
x † ∈ S(y) for y ∈ dom(Ψ).Note that x † ∈ S(y) is equivalent to x † being a R-minimising vector amongst all vectors that satisfy 0 = W * σ(W x † + b) − y , where σ denotes the proximal map with respect to Ψ.This is due to the fact that In order to be able to derive error estimates of the form (6), we restrict ourselves to solutions x † that are in the range of R α .This means that there exists y † such that x † ∈ R α (y † ).Considering the optimality condition of (2) for y † , this implies In the following, we verify that the symmetric Bregman distance with respect to R between a solution of the regularisation operator and the solution of the inverse problem is converging to zero if the error in the data is converging to zero.The symmetric Bregman distance or Jeffreys distance between two vectors x and x simply is the sum of two Bregman distances with interchanged arguments, i.e.
for q ∈ ∂R(x) and q ∈ ∂R(x); hence, an error estimate in the symmetric Bregman distance also implies an error estimate in the classical Bregman distance.
Before we begin our analysis, we recall the concept of the Jensen-Shannon divergence (Lin, 1991), which for general proper, convex and lower semi-continuous functions F : R n → R ∪ {∞} generalises to so-called Burbea-Rao divergences (Burbea and Rao, 1982a,b;Nielsen and Boltz, 2011) and are defined as follows.
DEFINITION 1 (Burbea-Rao divergence).Suppose F : R n → R ∪ {∞} is a proper, convex and lower semi-continuous function.The corresponding Burbea-Rao divergence is defined as for all x, x ∈ dom(F ).
Another important concept that we need in order to establish error estimates is that of Fenchel conjugates (cf.Beck (2017)).
DEFINITION 2 (Fenchel conjugate).The Fenchel (or convex) conjugate The Fenchel conjugate that is of particular interest to us is the conjugate of the function B Ψ (y, z) with respect to the second argument, which we characterise with the following lemma.
LEMMA 1.The Fenchel conjugate of F y (z) := B Ψ (y, z) with respect to the second argument z reads PROOF.From the definition of the Fenchel conjugate we observe which concludes the proof.
Having defined the Burbea-Rao divergence and having established the Fenchel conjugate of B Ψ (y, z) with respect to the second argument z for fixed y, we can now present and verify our main result that is motivated by (Benning and Burger, 2011).THEOREM 2. Suppose R and Ψ satisfy the assumptions outlined in Section 3.1.Then, for data y δ and x † that satisfy B Ψ (y δ , W x † + b) ≤ δ 2 with δ ≥ 0, a solution x α ∈ R α (y δ ) of the variational regularisation problem (2), and a solution x † of the perceptron problem y = σ(W x † + b) that satisfies x † ∈ S(y) and (SC), we observe the error estimate for a constant c ∈ (0, 1]. PROOF.Every solution x α that satisfies x α ∈ R α (y δ ) can equivalently be characterised by the optimality condition for any subgradient p α ∈ ∂R(x α ).Subtracting p † ∈ ∂R(x † ) from both sides of the equation and taking a dual product with x α − x † then yields We easily verify We know 0 ≤ D B Ψ (y δ ,W •+b) (x † , x α ) due to the convexity of B Ψ (y δ , W • +b), and we also know that (SC) enables us to choose p † = W * v † .Hence, we can estimate Next, we introduce the constant c ∈ (0, 1] to split the loss functions B Ψ (y δ , W x α +b) and respectively.This means we estimate Next, we make use of Lemma 1 to estimate and Adding both estimates together yields which together with the error bound B Ψ (y δ , W x † + b) ≤ δ 2 concludes the proof.
REMARK 1.We want to emphasise that for continuous Ψ and c > 0 we automatically observe in which case the important question from an error estimate point-of-view is if the term converges quicker to zero than α, as we would need to guarantee lim α→0 J Ψ y δ + α c v † , y δ − α c v † /α = 0 in order to guarantee that the symmetric Bregman distance in (8) converges to zero for α → 0.
EXAMPLE 1 (ReLU perceptron).Let us consider a concrete example to demonstrate that (5) is a convergent regularisation with respect to the symmetric Bregman distance of R. We know that for σ(z) = prox Ψ (z) = max(0, z) to hold true we have to choose ∞ else .This means that for B Ψ (y δ , z) to be well-defined for any z we require y δ i ≥ 0 for all i ∈ {1, . . ., m}.In order for the Burbea-Rao divergence to be well-defined, we further require Hence, we can simplify the estimate (8) to where we have also divided by α on both sides of the inequality.If we choose α(δ) = c(1 + c)δ/ v † , we obtain the estimate as long as we can ensure Together with we have established an estimate of the form (6), with constant C = 2 1+c c v † .Hence, we have verified that the variational regularisation method (2) is not only a regularisation method but even a convergent regularisation method in this specific example.
We want to briefly comment on the extension of the convergence analysis to the general case L > 1 with the following remark.
REMARK 2. The presented convergence analysis easily extends to a sequential, layer-wise inversion approach.Suppose we have L layers and begin with the final layer, then we can formulate the variational problem which is also of the form of (5), but where R has been replaced with Ψ L−1 .Alternatively, one can also replace Ψ L−1 with another function R L−1 if good prior knowledge for the auxiliary variable x L−1 exists.
Once we have estimated x α L−1 , we can recursively estimate for l = L − 1, . . ., 2 and subsequently compute x α as a solution of (2) but with data x α 1 instead of y δ .The advantage of such a sequential approach is that every individual regularisation problem is convex and the previously presented theorems and guarantees still apply.The disadvantage is that for this approach to work in theory, we require bounds for every auxiliary variable of the form B Ψ l (x α l , W l x α l−1 + b l ) ≤ δ 2 l , which is a rather unrealistic assumption.Moreover, it is also not realistic to assume that good prior knowledge for the variables exist.
Please note that showing that the simultaneous approach (2) is a (convergent) variational regularisation is beyond the scope of this work as it is harder and potentially requires additional assumptions for the following reason.The overall objective function in (2) is no longer guaranteed to be convex with respect to all variables simultaneously, which means that we cannot simply carry over the analysis of the single-layer to the multi-layer perceptron case.
This concludes the theoretical analysis of the perceptron inversion model.In the following section we focus on how to implement (5) and its more general counterpart (2).

IMPLEMENTATION
In this section, we describe how to computationally implement the proposed variational regularisation for both the single-layer and the multi-layer perceptron setting.More specifically, we show that the proposed variational regularisation can be efficiently solved via a generalised primal-dual hybrid gradient method and a coordinate descent approach.

Inverting perceptrons
To begin with, we first consider the example of inverting a (single-layer) perceptron.For L = 1, Problem (2) reduces to (5), which for a composite regularisation function R • K reads Here K is a matrix and αR(Kx) denotes the regularisation function acting on the argument x.The above Problem (10) can be reformulated to the saddle-point problem where R * denotes the convex conjugate of R. Computationally, we can then solve the saddle-point problem with a generalised primal-dual hybrid gradient (PDHG) method (Zhu and Chan, 2008;Pock et al., 2009;Esser et al., 2010;Chambolle andPock, 2011, 2016;Benning and Riis, 2021): where we alternate between a descent step in the x variable and an ascent step in the dual variable z.Since ( 10) is a convex minimisation problem, ( 12) is guaranteed to converge globally for arbitrary starting point, given that τ x and τ z are chosen such that τ x τ z < 1/ K .
In this work, we will focus on the discrete total variation ∇x p,1 , (Rudin et al., 1992;Chambolle and Lions, 1997), as our regularisation function R(Kx), but other choices are certainly possible.If we consider a two-dimensional scalar-valued image x ∈ R H×W , we can define a finite forward difference discretisation of the gradient operator ∇ : R H×W → R H×W ×2 as The discrete total variation is defined as the 1 norm of the p-norm of the pixel-wise image gradients, i.e.
For our numerical results we consider the isotropic total variation and consequently choose p = 2. Hence for a perceptron with affine-linear transformation f (x, Θ) = W x + b, and with σ = prox Ψ denoting the activation function, the PDHG approach (12) of solving the perceptron inversion problem ( 5) can be summarised as Please note that we define the discrete approximation of the divergence div such that it satisfies div = −∇ in order to be the negative transpose of the discretised finite difference approximation of the gradient in analogy to the continuous case, which is why the sign in (13a) is flipped in comparison to (12a).The proximal map with regards to the convex conjugate of • * 2,1 is simply the argument itself if the maximum of the Euclidean vector-norm per pixel is bounded by one or the projection onto this unit ball.

Inverting multi-layer perceptrons
We now discuss the implementation of the inversion of multi-layer perceptrons with L layers as described in (2).Note that in this case in order to minimise for x, we also need to optimise with respect to the auxiliary variables x 1 , . . ., x L−1 .
For the minimisation of (2) we consider an alternating minimisation approach, also known as coordinate descent (Beck and Tetruashvili, 2013;Wright, 2015;Wright and Recht, 2022).In this approach we minimise the objective with respect to one variable at a time.In particular, we focus on a semi-explicit coordinate descent algorithm, where we linearise with respect to the smooth functions of the overall objective function.This breaks down the overall minimisation problem into L sub-problems, where for x 0 and each x l variable for l ∈ {1, . . ., L − 1}, we have individual minimisation problems of the following form: x k+1 l = arg min Note that one advantage for adopting this approach is that we exploit that the overall objective function is convex in each individual variable when all other variables are kept fixed.In the following, we will discuss different strategies to computationally solve each sub-problem.
When optimising with respect to the input variable x 0 , the structure of sub-problem (14a) is identical to the perceptron inversion problem that we have discussed in Section 4.1.Hence, we can approximate x k+1 0 with ( 11), but now with respect to x k 1 instead of y δ , which yields the iteration For each auxiliary variable x l with l ∈ {1, . . ., L − 1}, the sub-problem associated with (14b) amounts to solving a proximal gradient step with suitable step-size τ x l , which we can rewrite to This concludes the discussion on the implementation of the regularised single-layer and multi-layer perceptron inversion.In the next section, we present some numerical results to demonstrate the effectiveness of the proposed approaches empirically.

NUMERICAL RESULTS
In this section, we present numerical results for the perceptron inversion problem implemented with the PDHG algorithm as outlined in (13), and for the multi-layer perceptron inversion problem implemented with the coordinate descent approach as described in ( 15) and ( 16).All results have been computed using PyTorch 3.7 on an Intel Xeon CPU E5-2630 v4.

The Perceptron
We present results for two experiments: the first one is the perceptron inversion of the image of a circle from the noisy output of the perceptron, where we compare the Landweber regularisation and the total-variation-based variational regularisation (5).For the second experiment, we perform perceptron inversion for samples from the MNIST dataset and compare them with the performance of linear and nonlinear decoders.Circle We begin with the toy example of recovering the image of a circle from noisy measurements of a ReLU perceptron.To prepare the experiment, we generate a circle image x † ∈ R 64×64 , as shown in Figure 1.We construct a perceptron with ReLU activation function using random weights and biases where W ∈ R 512×4096 , b ∈ R 512×1 .The weights operates on the column-vector representation of x, where x ∈ R 4096×1 .The noise-free data is generated via the forward operation of the model, i.e. y = σ(W x † + b).We generate noisy data y δ by adding Gaussian noise with mean 0 and standard deviation 0.005.Note that we clip all the negative values of y δ to ensure y δ ∈ dom(Ψ).
A first attempt to solve this ill-posed perceptron inversion problem is via Landweber regularisation (Landweber, 1951).In Figure 2 we see the reconstructed image obtained with Landweber regularisation in combination with early stopping following Morozov's discrepancy principle Morozov (2012);Engl et al. (1996).Even though the Landweber regularised reconstruction matches the data up to the discrepancy value σ(W x K + b) − y δ , the recovered image does not resemble the image x † .We will discuss shortly the reason for this visually poor inversion.In comparison, we see a regularised inversion via the total variation regularisation approach following (13) in Figure 3.The regularisation parameter for this reconstruction is chosen as α = 1.5 × 10 −2 .Both x 0 and z are initialised with zero vectors.The stepsize-parameters are chosen as τ x = 1.99/W 2 2 and τ z = 1/(8α), see (Chambolle, 2004).We stop the iterations when changes in x 0 and z in norm are less than a threshold of 10 −5 or when we reach the maximum number of iterations, which we set to 10000.As shown in Figure 3, the TV-regularisation approach is capable of finding a (visually) more meaningful solution.
To explain why the Landweber iteration performs worse compared to the total variation regularisation for this specific example, we compare the 2 norms of each two solutions and the groundtruth image x † .The 2 norm of the Landweber solution in Figure 2 measures 6.58 while the TV-regularised solution as in Figure 3 and the groundtruth image x † measure 25.69 and 28.07 respectively.This is not surprising, as the Landweber iteration is known to converge to a minimal Euclidean norm solution if the noise level converges to zero.On the other hand, when we compare the TV semi-norm of each solution, the groundtruth image in measures 128.0, while the Landweber solution in Figure 2 and TV-regularised solution in Figure 3 measure 707.02 and 114.93 respectively, suggesting that the TV-semi-norm is a more suitable regularisation function for the inversion of cartoon-like images such as x † .MNIST In this second example, we perform perceptron inversion on the MNIST dataset (LeCun et al., 1998).In particular, we consider the following experimental setup.We first train an autoencoder A(x) = D(E(x, Θ E ), Θ D ), where D(•, Θ D ) and E(•, Θ E ) denotes the decoder and the encoder, parametrised by To be more precise, we first train a two-layer fully connected autoencoder y = W 2 (σ(W 1 x + b 1 )) + b 2 using the vanilla stochastic gradient method (SGM) by minimising the mean squared error (MSE) on the MNIST training dataset.We set the code dimension to 100 and use ReLU as the activation function.Hence All MNIST images are centred as a means of pre-processing.Algorithmically, we follow (13) to computationally solve (10).The stepsize-parameters are chosen at τ x = 1.99/W 1 2 2 and τ z = 1/(8α).We choose the regularisation parameter α in the range [10 −4 , 10 −2 ] and set to 5 × 10 −3 for all sample images from the training set, and set to α = 5 × 10 −2 for all sample images from the validation set.These choices work well with regards to the visual quality of the inverted images.
In Figure 4 and Figure 5, we show visualisations of five sample images from the training set, and from the validation set respectively.In comparison, we have also visualised the decoder output.As can be seen, using the code that contains the same compressed information, the inverted images show more clearly  defined edges and better visual quality than the decoded outputs.This is to be expected as we compare a nonlinear regularised inversion method with a linear decoder.

Multi-layer perceptrons
In this section, we present numerical results for inverting multi-layer perceptrons.In particular, we consider feedforward neural networks with convolutional layers (CNN), where in the network architecture two-dimensional convolution operations are used to represent the linear operations in the affine-linear functions f (x, Θ).Similar to the experimental design described in Section 5.1, we consider a multi-layer neural network inversion problem where we infer input image x from a noise perturbed code y δ .
More specifically, we first train a six-layer convolutional autoencoder on the MNIST training dataset via stochastic gradient method to minimise the MSE.The encoder E(x, Θ E ) consists of two convolutional layers, both with 4 × 4 convolutions with stride 2, each followed by the application of a ReLU activation function.As image spatial dimension reduce by half, we double the number of feature channels from 8 to 16.We use a fully-connected layer with weights W 3 ∈ R 300×784 and bias b 3 ∈ R 300×1 to generate the code.The decoder network first expands the code with an affine-linear transformation with weights W 4 ∈ R 784×300 and bias b 4 ∈ R 784×1 .This is followed by two layers of transpose convolutionals with kernel size 4 × 4, where each is followed by a ReLU activation function.The number of feature channels halves each time as we double the spatial dimension.Following the implementation details outlined in Section 4.2, we iteratively compute the update steps ( 15) and ( 16).For the PDHG method, we choose the stepsize-parameters as τ x = 1.99/W 1 2 2 and τ z = 1/(8α).The initial values x 0 and z are both zero.The update steps stop either after reaching the maximum iterations of 1500 or when the improvements on x 0 and z are less than 10 −5 in norm.For the coordinate descent algorithm, the stepsize-parameters are set to τ x l = 1.99/W l+1 2 2 for each layer.In Figure 6 and Figure 7, we visualise the inverted images, the decoder output images, along with the groundtruth images, from the training dataset and validation dataset respectively.For each image, α is chosen in the range [10 −4 , 10 −2 ] and set at 9 × 10 −3 for both training sample images and validation sample images for best visual inversion quality.
In Figure 8 we further compare how total variation regularisation and decoder respond to different levels of data noise.The noisy data is produced by adding Gaussian noise to perturb the code of each image.We start with zero mean Gaussian noise with standard deviation 0.33 and gradually reduce the noise level, this translates to decreasing δ 2 from 6.80 down to 0.00.
Please note that for each noise level the regularisation factor α is manually selected in the range [10 −4 , 10 −2 ] for the best PSNR value.As we can see, for the noise level with standard deviation 0.33 where δ 2 is at 6.80, the decoder is only capable of producing a blurry distorted output, while the inverted image shows the structure of the digit more clearly.When we decrease the noise level down to 0.00, the inverted image becomes more clean-cut while the decoded image is still less sharply defined.
Figure 9 plots the PSNR value of the decoded and inverted image against decreasing noise level.We want to emphasise that it would be more rigorous to compute and compare D symm R (x α(δ) , x † ) as suggested in the error estimate bound in (8), but empirically the PSNR value does also support the notion of a convergent regularisation.the multi-layer perceptron case, which can be carried out sequentially, albeit under unrealistic assumptions.We have discussed implementation strategies to solve the proposed scheme computationally, and presented numerical results for the regularised inversion of the image of a circle and piecewise constant images of hand-written digits from single-and multi-layer perceptron outputs with total variation regularisation.Despite all the positive achievements presented in this work, the proposed framework also has some limitations.The framework is currently restricted to feed-forward architectures with affine-linear transformations and proximal activation functions.While it is straight-forward to extend the framework to other architectures such as ResNets (He et al., 2016) or U-Nets (Ronneberger et al., 2015), it is not straight-forward to include nonlinear operations that cannot be expressed as proximal maps of convex functions, such as max-pooling.However, for many examples there exist remedies, such as using average pooling instead of max-pooling in the previous example.
An open question is how a convergence theory without restrictive, unrealistic assumptions can be established for the multi-layer case.One issue is the non-convexity of the proposed formulation.A remedy could be the use of different architectures that lead to lifted Bregman formulations that are jointly convex in all auxiliary variables.And last but not least, one would also like to consider other forms of regularisation, such as iterative regularisation, data-driven regularisations (Kabri et al., 2022), or even combinations of both (Aspri et al., 2020).However, a convergence analysis for such approaches is currently an open problem.Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. (2017).Unpaired image-to-image translation using cycleconsistent adversarial networks.In Proceedings of the IEEE international conference on computer vision.2223-2232 Zhu, M. and Chan, T. ( 2008).An efficient primal-dual hybrid gradient algorithm for total variation image restoration.Ucla Cam Report 34, 8-34

Figure 4 .
Figure 4. Groundtruth input images from the MNIST training dataset, together with the corresponding autoencoder output images and inverted input images of the perceptron Figure 5. Groundtruth input images from the MNIST validation dataset, together with the corresponding autoencoder output images and inverted input images of the perceptron

Figure 6 .
Figure 6.Inverted input images of the CNN from the MNIST training dataset, together with well-trained autoencoder output images and groudtruth input images

Figure 7 .
Figure 7. Inverted input images of the CNN from the MNIST validation dataset, together with well-trained autoencoder output images and groudtruth input images

Figure 8 .
Figure 8. Visualisation of the comparison between inverted image and decoded image against various levels of noise.Top: Decoded output image from the trained convolutional autoencoder.Bottom: Inverted input image from the CNN with total variation regularisation.

Figure 9 .
Figure 9.Comparison of PSNR values of total variation-based reconstruction and decoder output per noise level.Each curve reports the change of PSNR value over gradually decreasing levels of Gaussian noise, with δ 2 ranging from 0.00 to 6.80.
Benning and Burger (2018))e of a source condition element v † that satisfies the source condition (cf.Engl et al. (1996);Benning and Burger (2018))