Algorithm Unfolding for Block-sparse and MMV Problems with Reduced Training Overhead

In this paper we consider algorithm unfolding for the Multiple Measurement Vector (MMV) problem in the case where only few training samples are available. Algorithm unfolding has been shown to empirically speed-up in a data-driven way the convergence of various classical iterative algorithms but for supervised learning it is important to achieve this with minimal training data. For this we consider learned block iterative shrinkage thresholding algorithm (LBISTA) under different training strategies. To approach almost data-free optimization at minimal training overhead the number of trainable parameters for algorithm unfolding has to be substantially reduced. We therefore explicitly propose a reduced-size network architecture based on the Kronecker structure imposed by the MMV observation model and present the corresponding theory in this context. To ensure proper generalization, we then extend the analytic weight approach by Lui et al to LBISTA and the MMV setting. Rigorous theoretical guarantees and convergence results are stated for this case. We show that the network weights can be computed by solving an explicit equation at the reduced MMV dimensions which also admits a closed-form solution. Towards more practical problems, we then consider convolutional observation models and show that the proposed architecture and the analytical weight computation can be further simplified and thus open new directions for convolutional neural networks. Finally, we evaluate the unfolded algorithms in numerical experiments and discuss connections to other sparse recovering algorithms.


Introduction
This paper connects the multiple measurement vector (MMV) problem, block-or joint-sparsity and recent results of deep unfolding of the iterative shrinkage thresholding algorithm (ISTA) to reconstruct unknown joint-sparse vectors from given linear observations.Such vectors could be for example signals received at the different antennas in a wireless communication problem or, in a computational imaging setup, discrete images observed at different detectors or aggregation stages.
Compressed sensing is a way to reconstruct compressive measurements from their underdetermined systems and first theoretical breakthroughs were achieved by Candès, Romberg, Tao and Donoho [7,16] leading to an approach where fewer samples can be used, than stated within the Nyquist-Shannon sampling theorem [37].They were able to show that unknown vectors can be reconstructed using convex optimization if the linear mapping fulfilled certain assumptions [7,8].These idea rely on minimizing 1 -norm to promote sparsity and the approach of basis pursuit [35].These convex optimization problems could then be solved with iterative algorithms, in [19] gradient projection approaches are presented and in [20] the idea of thresholding algorithms, which will be also discussed in this work.Although this is already a well researched field, in practice leading to high computational effort because of many iterations and large underlying systems and is thus not suitable for real world applications.Thus Karol Gregor and Yann LeCun proposed to use the iterative structure of these algorithms for a neural network and to train each iteration step [24], which is also referred to as deep unfolding and will be discussed in Section 2. Convergence for deep unfolding of the iterative thresholding algorithm has also been studied by Chen et al. [11].Chen et al. proposed in [27] to exclude the weight matrix from the data-driven optimization approach and pre-compute this by data-free optimization and thus presented Analytical LISTA (ALISTA).In recent results Chen et al. could reduce the training procedure even more, by showing that only tuning three hyperparameters is sufficient, proposing HyperLISTA [12].Creating large sets of training data is often difficult in practice and thus it is important to reduce the trainable parameters.Therefore, we'll extend the already stated concepts for the block-sparse setting and especially for the multiple measurement vector problem, developing suitable learned algorithms, with only a few trainable parameters and similar theoretical guarantees as ALISTA.
In the following we derive the connection to block-sparsity.
In an MMV problem we assume that we derived d ∈ N measurements y l ∈ R m , l = 1, . . ., d from d sparse signal vectors x l ∈ R n sharing the same support supp(x l ) = {i : |x l i | = 0}, which is referred to as joint-sparsity [39,9].The MMV problem can then be presented as solving the following d equations for l = 1, . . ., d and with K ∈ R m×n .This can be rewritten in the following matrix equation form where X = (x 1 , . . ., x d ) ∈ R n×d , Y = (y 1 , . . ., y d ) ∈ R m×d .With the vectorizing operator vec(•), stacking each column of an matrix on each other, we can cast (2) into a block-sparse problem.We have where we used the well-known vectorization property of matrix equations, see for example [36].
Here ⊗ is the Kronecker product.

Block Sparsity
In the more general setting we want to reconstruct an unknown vector x ∈ R nx from a given matrix D ∈ R ny×nx and given y ∈ R ny with n x = nd, n y = md for some n, m, d ∈ N. We assume that noise ∈ R ny is added to Dx.In applications we often have this problem is ill-posed, i.e. n y < n x or D is not invertible.We assume that x is the concatenation of n "smaller" vectors of length d, called blocks, i.e.
Following the notation in [38] we define 0 and equal to zero otherwise and the 2 -norm is defined as Without loss of generality we can assume that these blocks are orthonormal, see [38] Here we use A 2 = λ max (A T A), where λ max denotes the largest eigenvalue of matrix A. Note that the block coherence can also be introduced with a normalization factor 1/ ( D[i] 2 D[j] 2 ), but since we assume orthonormal blocks we can neglect this.This reduces to the already known coherence where D :,i denotes the ith column of D, see [17].In [38] it is shown that 0 ≤ µ b (D) ≤ µ(D) ≤ 1 and it is possible to derive recovery statements for small µ b similar to µ, for details see [38].We can consider also the cross block coherence, which compares two matrices and will be important in the following.
Similar to ordinary basis pursuit and LASSO [10,31,21,35] we use the following 2,1 -LASSO to solve (3) where x 2,1 = n i=1 x[i] 2 is the 2,1 -norm of x, which will promote block-sparsity for the solution of (7).This convex programm can be solved by the fixed point iteration known as the block iterative shrinkage thresholding algroithm (Block-ISTA/BISTA) where η α is the block-soft-thresholding operator, given as BISTA is an already well studied proximal gradient algorithm based on the functional in (7).It is known that the fixed point iteration in (8) converges for γ ∈ (0, 1 L ), where L = D 2 2 is the Lipschitz constant of the least squares term in (7) w.r.t to x, to a solution of (7), if at least one solution exists, see for example [6,2,3].On the other hand the choice of the regularization parameter α has to be done empirically and is very crucial for a "good" recovery.If α is set too large, this can lead to too much damping, possibly setting blocks to 0 that actually have a non-zero norm .If α is too small we get reverse effects.In practice this leads to problems, since computing the iterations require high computational effort.Deep unfolding is a way to tackle these problems, i.e. reduce the number of iterations by learning optimal regularization parameters and step-sizes.There are already classical concepts in increasing the convergence speed by using an additional step in updating the current iterate x (k) , by using the previous x (k −1) , resulting in Block Fast ISTA [3,15].On the other hand the choice of optimal parameters is still solved empirically.

Deep Unfolding and Learned BISTA
Recently, the idea of deep unfolding has been developed, where the goal is to optimize these parameters of such an iterative algorithm, [24,28,27].This in turn gives us an iterative algorithm with optimal chosen step-size γ and regularization parameters, but we will see that we don't have to restrict our self only to those parameters.Recently Fu et al. proposed Ada-BlockLISTA by applying deep unfolding to block-sparse recovery [22].They show an increase in the convergence speed with numerical examples, but don't cover theoretical studies.
In the following we are going to present the idea of deep unfolding, then we are going to derive Learned BISTA (LBISTA).

Deep Unfolding
We will now formalize the concept of deep unfolding for an arbitrary operator that depends on a certain set of parameters, before we apply this to the previously presented fixed point iteration.To this end we define an operator which depends on a set of parameters θ ∈ Θ, like the stepsize of a gradient descent operator and an input y.For example for BISTA this would be θ = (α, γ) with We assume that Fix(T (• ; θ, y)) = ∅ and that we have convergence for the fixed point iteration for an arbitrary x (0) ∈ X. Deep unfolding now interprets each iteration step as the layer of a neural network and uses the parameters θ ∈ Θ as trainable variables.In more detail this means we look at the Kth iteration of (11), i.e. the composition where we get the full parameter space Θ = Θ×• • •× Θ, with trainable variables θ = K i=1 θ (i) .T Θ will then be the neural network which will be trained with respect to θ.It seems thus that deep unfolding can be applied to any iterative algorithm and help us to estimate the best choice of parameters, but we will only present deep unfolding for BISTA and consider deep unfolding for arbitrary operators in future work.

Learning
In this section we give an overview for the training procedure used in this work.The general idea of supervised learning is to choose model parameters such that the predictions are close, in some sense, to the unknown target, i.e., in our case the unknown vector x in (3) generating the measurement y.Hence, we aim to minimize an expected loss over an unknown distribution D: where (•) is a given loss function, x = T θ x (0) ; y is the output of the model and x * is the ground-truth.
Here, we will use the squared 2 -loss (x) = 1 2 x 2 2 .The objective functional R(θ) = E x * ˜D [ (x − x * )] is also called risk of the model.Since the underlying distribution D is unknown we take a batch of S independently drawn samples of input and output data (x * j , y j ) for j = 1, . . ., n train according to (3) and minimize instead data-driven the empirical risk Proceeding in this way for all layer at once is sometimes referred as end-to-end learning.Because of the special structure of our deep unfolding models, and inspired by [33], we instead train the network layer-wise, by optimizing only θ (k−1) case for layer k yielding the following training procedure: where x(k) is the output of the kth layer.We realized this training as follows, we generate a validation set x * i,validation , y i,validation , used to evaluate the model while training and a training set x * i,train , y i,train , i = 1, . . ., n train , used to calculate (15).This objective is locally minimized by gradient descent methods.As a stopping criteria we evaluate the normalized mean square error, defined as depending on the validation set, and stop if the maximum of all evaluated N M SE stays the same for a given number of iterations.See Algorithm 1, where Adam is the ADAM Optimizer [30] depending on an training rate t r and the functional which should be minimized with respect to given variables, here the loss function (•) with respect to θ k−1 .

Learned BISTA
In the following we present four different unfolding techniques for BISTA.We present a tied (weights are shared between different layers) and untied (individual weights per layer) case, which refers to different training approaches for the matrices Tied LBISTA: The idea of LBISTA is now to fix the matrices S and B for all layers, but also include them in our set of trainable variables: For LBISTA (16) we get trainable variables Algorithm ( 16) is also referred to as vanilla LISTA in the sparse case.Inspired by the LISTA-CP model, i.e.LISTA with coupled parameters, proposed in [27] we will also consider LBISTA-CP For LBISTA-CP (17) we get Untied LBISTA: The idea of untied LBISTA is then to use in each layer different matrices S and B to train, i.e., For LBISTA (untied) (18) we get trainable variables .
Hence, compared to O(n y n x + K) parameters in the tied case, now more training data and longer training time is required to train now O(Kn y n x +K) parameters.
We initialize the trainable variables with values from original BISTA.In [11] it has been shown that convergence of LISTA-CP (untied) can be guaranteed if the matrices B (k) belong to a certain set and their proof can be extended to blocksparsity.
The steps are very similar to the convergence proof for Learned Block Analytical ISTA given in the next section.

Analytical LBISTA
In the previous section we presented several approaches for learned BISTA where optimal weights are optimized in a data-driven fashion.In [27] Liu et al. instead proposed to analytically pre-compute the weights and only train step-size and threshold parameters.
It turns out that this Analytic LISTA (ALISTA) with the so called analytical weight matrix is as good as the learned weights.In the following we are going to extend and improving the theoretical statements for ALISTA to the block-sparse case and propose Analytical LBISTA.In contrast to [27], we will provide a direct solution and also show different ways to calculate the analytical weight matrix in different settings.

Upper and Lower Bound
This part of the paper will focus on combining and extending several theoretical statements from [11,27] and applying these for the block-sparse case.With the following two theorems we are then going to present Analytical LBISTA, by showing that this is as good as LBISTA-CP (untied) with a pre-computed B.

Upper Bound
In this section we start with an upper bound for the error of the approximation generated by (19), i.e.LBISTA-CP (untied), and the exact solution x * for given parameters.For this we modify Assumption 1 from [27] to be consistent with the block-sparse setting.
Assumption 1.We assume (x, ) ∈ X b (M, s, σ) with: As already mentioned, a matrix with small blockcoherence has good recovery conditions.In [27] Liu et al. propose Analytical LISTA, where the precomputed matrix B is minimizing the mutual crosscoherence.This motivates the following definition.
We define, analogously to [27] with W b (D) the set of all B ∈ R ny×nx which attain the infimum in (21 Note that the set W b (D) is non empty, because the set of feasible matrices {B ∈ R ny×nx : 21) is a feasible and bounded program, see supplementary material to [11].We will call matrices from W b (D) analytical weight matrices.Definition 4. The block-support of a block-sparsevector x ∈ X B (b, s) is defined as We will now derive an upper bound for the 2error and thus showing convergence of LBISTA-CP for a special matrix B and given parameters α (k) and γ (k) .In [11] Liu et al. showed linear convergence for unfolded ISTA with additional noise, more precivesly for LISTA-CP (untied), if the matrices B (k) belong to a certain set.In [27] it was shown that we can pre-compute such a matrix B, chose B (k) = B, chosen by a data-free optimization problem and still have the same performance.For this new proposed unfolded algorithm linear convergence was also shown, without additional noise.In [27] convergence only in the noiseless case was shown, but the results derived in [11] were derived with bounded noise.Thus we are going to combine these two proofs and extend it to blocksparsity: For given (x, ), y = Dx + and parameters {θ (k) } K k=1 , we abbreviate with {x (k) (x, )} K k=1 the sequence generated by (19) with x (0) = 0. Further, we define Theorem 1.For any B ∈ W b (D) and any sequence γ (k) ∈ 0, and parameters for some κ ≥ 1, with µ = dμ b (D).With M > 0 and s < (µ −1 + 1)/2 we have: where Note that with κ = 1 we obtain the results in [27] but one can't always meet this condition.But we need at least α (k) ≥ dγ (k) μb (D)C k() X + Cσ to have (25).On the other hand, one can always find such a κ from the trained parameters and thus use therefore the theorem afterwards.Obviously a worse α effects the upper bound of the 2 -error and thus appears in ã(τ ).Summarizing, above theorem shows now convergence on the training set, even if κ = 1.

Lower Bound
This section states the lower bound for the 2,1error, showing that for convergence in the 2,1 -norm the defined parameters in Theorem 1 are optimal chosen.We now modify Assumption 2 from [27] to be consistent with the block-sparse setting.
Assumption 2. x * is sampled from P X .P X satisfies: 2 ≤ S ≤ s and S is uniform distributed over the whole index set.The non-zero blocks of x * satisfy the uniform distribution and x[i] 2 ≤ M for all i ∈ S.And we assume = 0.
The latter Theorem states that the analytical weight matrix should minimize the generalized mutual block coherence.Therefore, for a lower bound, we will only consider matrices that are bounded away from the identity. : The parameters are chosen from the following set.
k=0 and x (0) = 0. We define the set of all parameters guaranteeing no false positive blocks in x (k) by This set is non-empty, since (25) holds true if α (k) are chosen large enough.Following mainly the proof in [27] by extending the setting from sparsity to block-sparsity, the lower bound for the 2,1 -norm can be stated as follows.

Analytical LBISTA
Analogously to [27] and following the previous two theorems, decompose LBISTA-CP (untied), Algorithm 17, into two steps: where, in the first step, B is pre-computed, such that In the second step the parameters θ are trained layer wise, as discussed in the previous section.This results in a comparable method, with only O(K) trainable parameters, instead of O(n y n x + K) for LBISTA-CP 17 or even O(Kn y n x + K) for LBISTA (untied) 18.

Computing the Analytical Weight Matrix
ALBISTA relies on the analytical weight matrix, deriving this matrix can be challenging in practice, thus this section focuses on computing this matrix.
We follow the procedure in [27] by estimating (21) with an upper bound.But in addition to [27] we state a closed form for the upper bound, currently this is done by a projected gradient descent approach.

Solving An Upper Bound
Since the objective in ( 21) is not differentiable one solves the following upper bound problem min This is derived from the following inequality In [27] this is solved by a projected gradient method, but the following Theorem states a closed form of the solution of (31).
Theorem 3. The minimizer B ∈ R ny×nx of (31) is given as the concatenation where the n blocks are given as The proof can be found in Appendix C. Let d = 1 and the singular value decomposition of D given as D = V ΣU T and assume B = V ΣU T .Then the solution of ( 31) is given in an even simpler form, since , where d = diag(B + D), yields also a solution for (31).Here diag follows the matlab/python notation, where diag of a matrix gives the vector of the main diagonal and gives a diagonal matrix with a given vector on its main diagonal.For d ≥ 2 the orthonormal block constraints would not be met, thus not yielding a feasible solution,.

Computing the Analytical Weight Matrix in a MMV Problem
In practice, often large data sets are obtained, i.e. by a large amount of measurements or measurements y l , x l representing pictures.
For instance 1000 pixels and 200 measurements lead to a matrix D with (1000 • 200) 2 elements.Applying the theory for Analytical LBISTA could thus be difficult in practice.Although D is sparse, it can take a long time to calculate the analytical weight matrix.The following Theorem states the connection between the MMV setting and the block sparse setting for ALBISTA, showing that it's sufficient to minimize the generalized mutual coherence (32), see [27], for K instead of minimizing the generalized mutual block-coherence (21) for where B:,i refers to the ith column.Then the minimum of inf The proof can be found in Appendix D. Moreover the following relation holds, if thus the block-coherence of D can be enhanced by increasing the number of measurements d.Also it is feasible to solve (31) by the pseudo inverse in the MMV setting, since this is solved for K, i.e. d = 1

Circular Matrix Case
Consider now the following setting, where the measurements y l ∈ R n are obtained by a circular convolution of x l with vector k, i.e.
where K is a circular matrix generated by a vector k ∈ R n , i.e.K = circ(k) ∈ R n×n .Applying Theorem 3 to the circular case yields the following lemma.
The latter statement implies a simpler way to compute b ∈ R n by using singular value decomposition where σ(B) ∈ R n is the vector of singular values of B, respectively K, filled with zeros and U an unitary matrix.With U = 1/ (n)F , where F is the Fast Fourier Transform (FFT) matrix, this leads to the conclusion σ(B) = F b = b, and thus The expression 1/ k should be interpreted point wise and to be zero if ki = 0.This also concludes that the computation of B tends to be difficult in practice if K has not full rank, since this means that k has at least one zero entry.In this case b has to be scaled with

Toeplitz Matrix Case
Considering the more general convolutional setting, where k ∈ R m and x ∈ R n with m < n.This results in a Toeplitz matrix The reasoning of the previous section cannot be applied to show that the solution of ( 31) must be a Toeplitz matrix.But the following can be observed: Let b be constructed as discussed for K = circ(k), where k is the concatenation of k and a zero vector of suitable dimensions, i.e. the first column of K.
The analytical weight matrix B w.r.t.K can be constructed as where T is the m × m cyclic shift matrix, i.e.only the m×n submatrix of B = circ(b) ∈ R m×m is used.
The columns of K can also be expressed through the cyclic shift of k.Hence B is a feasible solution of (31) for K. On the other hand the cross coherence is bounded, since max i,j≤n.i =j Note: To have this upper bound K needs to have full rank, which is the case if the full time continuous FT of k has no zero points.Or one has to adjust the discrete grid.Thus constructing B by extending K to a circular matrix is a feasible approach.

Connection to CNNs
It is known that using the FFT in CNNs can increase the computation time, if the convolutional filter is big [34,14].In [34] (33) or (36) the gradient step can be viewed as follows This can be interpreted as a convolutional layer with kernel f (γ) = (e − γb * k) and bias b = b * y, where e = [1, 0, . . ., 0].This means that ALISTA, with a Toeplitz matrix, can be interpreted as a CNN only two trainable parameters per layer, γ, λ.In the setting of FFT-CNN the update rule can be formulated as On the other hand using FFT CNNs shows only a speed up if we deal with large data sets, i.e. by evaluating high resolution images, or large filters, i.e. if m is greater than log(n).

Numerical Examples
In the following we are going to present numerical results achieved by the presented algorithms1 .We will investigate two MMV scenarios.Firstly the measurements y l are obtained with a random Gaussian matrix and secondly obtained by a redcued-rank random circular convolution.In each scenario we will enforce the Kronecker structure and thus reducing the training cost by training only low dimensional m × n matrices.Furthermore in the convolutional setting the circular structure will be also enforced, thus reducing the training costs even more.To have a fair comparison, all algorithms are initialized with the analytical weight matrix.

MMV Setting
Case Dimensions rank(K) µ(K) Table 1.: Properties of matrix K in both scenarios.
The training data is sampled from an unknown distribution X and generated as follows.The signals x are generated for a given number of blocks n, given block length d and a possibility if a block x[i] is active or not, i.e., if x[i] 2 = 0 or x[i] 2 = 0, called pnz (probability of non-zeros).If a block is active, the elements of this blocks are given by a normal Gaussian distribution with variance σ 2 = 1.The measurements y are obtained by (3).Where the elements of are given from a normal Gaussian distribution with variance σ 2 = pnz • n x /n y • 10 −SN R dB /10 .SN R dB is the signal to noise ratio given in decibel.We consider the following cases.In each case we generate x with d = 15, n = 128, m = 32 and pnz = 10%.

Gaussian Measurenment Matrix:
In the Gaussian setting we sample a m × n matrix K iid from a Gaussian distribution with variance σ 2 = 1.We normalize the columns, s.t.D has orthonormal blocks, as assumed in the beginning.Circular Convolution Matrix: We construct the circular matrix as follows.At first we generate a random iid sampled vector ã but set a certain amount of elements to zero.We define k = F −1 ã and thus we can generate a rank deficient Matrix D = K ⊗ I d , where K = circ(k).Thus y l is obtained through a circular convolution with symmetric k = F k.It is important for this section to generate a rank deficient matrix to have compressive observations.Otherwise, we would get a trivial problem if K, respectively D, has full rank.Computing D = K ⊗ I d yields the desired matrix.The properties of these two matrices can be found in Table 1.

Discussion
The results of the proposed methods can be found in Figure 1a for the Gaussian Problem case in Figure 1b for the circular case.We also consider the performance of a version of AMP [18].AMP can be viewed as an Bayesian extension of ISTA with an additional Onsager correction term, before applying the thresholding operator, i.e.
where b (k) = E η (x (k) ) .Here we will train only γ, α, with the same procedure as already discussed and choose D T = B T .By using B T instead of D T we are resembling Orthogonal AMP (OAMP) [32], which follows a similar idea as ALISTA.In [32] B is chosen to be de-correlated with respect to K, i.e. tr(I ny − B T D) = 0.
The analytical weight matrix B satisfies also this condition.Moreover in [32] the choice of different matrices is also discussed.Thus, untrained ALISTA with correction term can be viewed as a special case of OAMP.Note also, that the structure of Trainable ISTA, proposed in [26], is also based on OAMP and thus there are interrelations between AMP, Unfolding ISTA and Analytical ISTA.We refer to learned AMP with analytical matrix as ALAMP.Differently to [32] we use the 2,1 -regularizer, instead of only 1 -regularization, since we consider the MMV setting.This is also discussed in [29,13].Every proposed algorithm performs better, in terms of NMSE, as their untrained original.Interestingly learned AMP, with analytical weight matrix B, has a performance almost as good as LBISTA (untied) with only a fraction of trainable parameters.This may come from the fact, that block-soft thresholding is not the correct MMSE estimator for the generated signals x and thus the correction of b (k) yields the better estimation for x.As expected we get almost the same performance of ALBISTA and LBISTA CP (untied).In Figure 2a we show similar plots for the justification of Theorem 1, as also seen in [27].In particular, Figure 2a shows that α (k)  γ (k) is proportional to the maximal 2,1 -error over all training signals.An interesting behaviour, which carries over from the sparse case, is that the learned 2,1 -regularization parameters approach zero, as k increases.If α is close to zero, we approach a least-squares problem.This means that after LBISTA found the support of the unknown signal x * the algorithm consist only of the least squares fitting.Figure 3b shows that the trained γ (k) are bound in an interval.Note that, in contrast to [27] Theorem 1 is based on a more general assumption, onto the thresholding parameters.One can take a suitable κ and obtains the upper bound for the 2 -error and thus have convergence on the training set, if the sparsity assumptions are met.Figure 3 shows the training loss over the training iterations for the results presented in 1.One can see, that ALBISTA needs less training iterations as LBISTA CP (untied) or LBISTA (untied) and thus less training data.The observed jumps occur when moving from one layer to the next, due to the layer-wise training, Algorithm 1.A layer is defined to be optimized, if NMSE converged within 1e − 5.

Conclusion
We proposed ALISTA for the block-sparse and MMV case, important for many real-world applications, and derived corresponding theoretical convergence and recovery results.We relaxed the conditions for the regularization parameter and thus obtained a more precise upper-bound after ALISTA      is trained.Nevertheless, this is still dependent on a sharp sparsity assumption on the unknown signals.
We investigated and derived a direct solution for the analytical weight matrix in the general block sparse setting as well for one convolutional scenarios.The last section provides numerical results and includes interrelations to AMP.

Appendices
Appendix A Proof of Theorem 1 The following proof combines proofs from [27] and [11] for the block-sparse setting with additional noise.
Proof.We proof (25) by induction: With k = 0 this statement is satisfied, because x (0) = 0. We fix k and assume (25) to be true.
From this we get the following inequality meaning that we get for all i ∈ S: By equation 25 we know x 2,1 from this we get: We take the supremum over (x * , ) ∈ X b (M, s, σ) of the previous inequality and get sup Set ã(k) = − log(a(k)) then the previous inequality can be reformulated into the following sup And with • 2 ≤ • 2,1 we get the final inequality.From s < (d + 1/μ b )/2d we get 2dsμ b − dμ b < 1 and with this we get for 0 < γ (τ ) ≤ 1 that α(τ ) > 0. In the other case 1 < γ And therefore we have α(τ ) > 0.
Step 2: We now estimate the probabilities for the sets in (40) and show (39) by utilizing where S d is the sphere in R d with radius 1.We then obtain The final result then follows from (38).
This can be solved with the method of Lagrange multipliers, see for example [5].The Lagrangian function is defined as where f is the objective function and h are the equality constraints, i.e. minimize f (x) such that h(x) = 0 and •, • an appropriate inner product.λ is called the Lagrangian multiplier.Therefore we get the following Lagrangian function This system has the following minimum norm solution where + denotes the Moore-Penrose inverse of a matrix [4].In [25] the Moore-Penrose inverse for a 2 × 2 block partitioned matrix was derived.Applying the first theorem in [25] to (43) yields the statement.

Appendix D Proof of Theorem 4
Proof.The proof is straight forward.Let B attain the infimum in To finish the proof we have to solve the following problem: Since we solve this system for every i, the Lagrangian variable could change for every i.On the other hand we observe the following The vector x = vec(X T ) is block-sparse with n blocks of length d, if the signals x l are jointly sparse.With D = (K ⊗ I d ) ∈ R ny×nx , where n y = m • d and n x = n • d, we obtain the block-sparse setting considered in this work.

Definition 1 .
, i.e.D[i] T D[i] = I d , where I d is the d × d identity matrix.This assumption simplifies the presentation of several statements below.The block-coherence of a matrix D ∈ R ny×nx is defined as
Parameters and maximal 2,1-error in the noiseless case.

Figure 2 .
Figure 2.: Plots justifying results in Theorem 1 forthe results on the problem with circular convolution matrix in Figure2aand with Gaussian measurement matrix ion 3b.

Figure 3 .
Figure 3.: Training history for results shown in Figure 1a respectively 1b.One can see that ALBISTA needs less training iterations than LBISTA CP (untied) or LBISTA (untied).

Appendix C Proof of Theorem 3 2 F 2 F
Proof.First, we noticeD T B 2 F = n i=1 D T B[i]therefore we can reduce (31): ∀i : minB[i] D T B[i] (41) s.t.D[i] T B[i] = I d .

LB
(B[i], Λ) = D T B[i] 2 F + Λ : D[i] T B[i] − I d with Lagrangian multiplier Λ ∈ R d×d .Here : is the Frobenius inner product defined asA : B = d i=1 d i=1 a i,j b i,jfor A, B ∈ R d×d .Thus we get the necessary conditions∇ B[i] L(B[i], Λ) = 2D D T B[i] + D[i]Λ = 0 ∇ Λ L(B[i], Λ) = D[i] T B[i] − I d = 0which leads to the following equation system2DD T D[i] D[i] T 0 K :,j | .Let B = B ⊗ I d .We have ( B ⊗ I d )[i] T (K ⊗ I d )[i] = I d • • • b1 m,i I d   k 1,i I d . . .k m,i I d   = BT :,i K :,i I d I d = I d since BT :,i K :,i = 1, thus B is a feasible solution.With the same argumentation we have maxi =j B T [i]D[j] I d . . .bm,i I d K :,j I d 2 = max i =j | BT :,i K :,j |taking the infimum on both sides yields the result, without normalization of 1 d .