Convergence of a Relaxed Variable Splitting Coarse Gradient Descent Method for Learning Sparse Weight Binarized Activation Neural Networks

Sparsification of neural networks is one of the effective complexity reduction methods to improve efficiency and generalizability. Binarized activation offers an additional computational saving for inference. Due to vanishing gradient issue in training networks with binarized activation, coarse gradient (a.k.a. straight through estimator) is adopted in practice. In this paper, we study the problem of coarse gradient descent (CGD) learning of a one hidden layer convolutional neural network (CNN) with binarized activation function and sparse weights. It is known that when the input data is Gaussian distributed, no-overlap one hidden layer CNN with ReLU activation and general weight can be learned by GD in polynomial time at high probability in regression problems with ground truth. We propose a relaxed variable splitting method integrating thresholding and coarse gradient descent. The sparsity in network weight is realized through thresholding during the CGD training process. We prove that under threshholding of $\ell_1, \ell_0,$ and transformed-$\ell_1$ penalties, no-overlap binary activation CNN can be learned with high probability, and the iterative weights converge to a global limit which is a transformation of the true weight under a novel sparsifying operation. We found explicit error estimates of sparse weights from the true weights.


Introduction
Deep neural networks (DNN) have achieved state-of-the-art performance on many machine learning tasks such as speech recognition [19], computer vision [22], and natural language processing [11]. Training such networks is a problem of minimizing a high-dimensional non-convex and non-smooth objective function, and is often solved by first-order methods such as stochastic gradient descent (SGD). Nevertheless, the success of neural network training remains to be understood from a theoretical perspective. Progress has been made in simplified model problems. [2] showed that even training a 3-node neural network is NPhard, and [34] showed learning a simple one-layer fully connected neural network is hard for some specific input distributions. Recently, several works ( [36]; [5]) focused on the geometric properties of loss functions, which is made possible by assuming that the input data distribution is Gaussian. They showed that SGD with random or zero initialization is able to train a no-overlap neural network in polynomial time.
Another prominent issue is that DNNs contain millions of parameters and lots of redundancies, potentially causing over-fitting and poor generalization [44] besides spending unnecessary computational resources. One way to reduce complexity is to sparsify the network weights using an empirical technique called pruning [23] so that the non-essential ones are zeroed out with minimal loss of performance [17,26,38]. Recently a surrogate 0 regularization approach based on a continuous relaxation of Bernoulli random variables in the distribution sense is introduced with encouraging results on small size image data sets [25]. This motivated our work here to study deterministic regularization of 0 via its Moreau envelope and related 1 penalties in a one hidden layer convolutional neural network model [5]. Moreover, we consider binarized activation which further reduces computational costs [42].
The architecture of the network is illustrated in Figure 1 similar to [5]. We consider the convolutional setting in which a sparse filter w ∈ R d is shared among different hidden nodes. The input sample is Z ∈ R k×d . Note that this is identical to the one layer non-overlapping case where the input is x ∈ R kd with k non-overlapping patches, each of size d. We also assume that the vectors of Z are i.i.d. Gaussian random vectors with zero mean and unit variance. Let G denote this distribution. Finally, let σ denote the binarized ReLU activation function, σ(z) := χ {z>0} which equals 1 if z > 0, and 0 otherwise. The output of the network in Figure 1 is given by: We address the realizable case, where the response training data is mapped from the input training data Z by equation (1) with a ground truth unit weight vector w * . The input training data is generated by sampling m training points Z 1 , .., Z m from a Gaussian distribution. The learning problem seeks w to minimize the empirical risk function: Due to binarized activation, the gradient of l in w is almost everywhere zero, hence in-effective for descent. Instead, an approximate gradient on the coarse scale, the so called coarse gradient (denoted as∇ w l) is adopted as proxy and is proved to drive the iterations to global minimum [42].
In the limit m ↑ ∞, the empirical risk l converges to the population risk: which is more regular in w than l. However, the "true gradient" ∇ w f is inaccessible in practice. On the other hand, the coarse gradient∇ w l in the limit m ↑ ∞ forms an acute angle with the true gradient [42]. Hence the expected coarse gradient descent (CGD) essentially minimizes the population risk f as desired.
Our task is to sparsify w in CGD. We note that the iterative thresholding algorithms (IT) are commonly used for retrieving sparse signals ( [3,4,7,10,45] and references therein). In high dimensional setting, IT algorithms provide simplicity and low computational cost, while also promote sparsity of the target vector. We shall investigate the convergence of CGD with simultaneous thresholding for the following objective function where f (w) is the population loss function of the network, and P is 0 , 1 , or the transformed-1 (T 1 ) function: a one parameter family of bilinear transformations composed with the absolute value function [28,46]. When acting on vectors, the T 1 penalty interpolates 0 and 1 with thresholding in closed analytical form for any parameter value [45]. The 1 thresholding function is known as softthresholding [10,13], and that of 0 the hard-thresholding [3,4]. The thresholding part should be properly integrated with CGD to be applicable for learning CNNs.
As pointed out in [25], it is beneficial to attain sparsity during the optimization (training) process.
Contribution. We propose a Relaxed Variable Splitting (RVS) approach combining thresholding and CGD for minimizing the following augmented objective function for a positive parameter β. We note in passing that minimizing L β in u recovers the original objective (4) with penalty P replaced by its Moreau envelope [27]. We shall prove that our algorithm (RVSCGD), alternately minimizing u and w, converges for 0 , 1 , and T 1 penalties to a global limit (w,ū) with high probability. A key estimate is the Lipschitz inequality of the expected coarse gradient (Lemma 5.3). Then the descent of Lagrangian function (5) and the angles between the iterated w and w * follows. Thew is a novel thresholded version of the true weight w * modulo some normalization. Theū is a sparse approximation of w * . To our best knowledge, this result is the first to establish the convergence of CGD for sparse weight binarized activation networks. In numerical experiments, we observed that theū limit of RVSCGD with the 0 penalty recovers sparse w * accurately.
Outline. In Section 2, we briefly overview related mathematical results in the study of neural networks and complexity reduction. Preliminaries are in section 3. In Section 4, we state and discuss the main results. The proofs of the main results are in Section 5, and numerical results in Section 6. The conclusion of this paper is in Section 7 and the acknowledgement is in Section 8.

Related Work
In recent years, significant progress has been made in the study of convergence in neural network training. From a theoretical point of view, optimizing (training) neural network is a non-convex non-smooth optimization problem. [2,24,33] showed that training a neural network is hard in the worst cases. [34] showed that if either the target function or input distribution is "nice", optimization, algorithms used in practice can succeed. Optimization methods in deep neural networks are often categorized into (stochastic) gradient descent methods and others.
Stochastic gradient descent methods were first proposed by [31]. The popular back-propagation algorithm was introduced in [32]. Since then, many well-known SGD methods with adaptive learning rates were proposed and applied in practice, such as the Polyak momentum [29], AdaGrad [16], RMSProp [37], Adam [21], and AMSGrad [30]. The behavior of gradient descent methods in neural networks is better understood when the input has Gaussian distribution. [36] showed that the population gradient descent can recover the true weight vector with random initialization for one-layer one-neuron model. [5] proved that a convolution filter with nonoverlapping input can be learned in polynomial time. [15] showed (stochastic) gradient descent with random initialization can learn the convolutional filter in polynomial time and the convergence rate depends on the smoothness of the input distribution and the closeness of patches. [14] analyzed the polynomial convergence guarantee of randomly initialized gradient descent algorithm for learning a one-hidden-layer convolutional neural network. A hybrid projected SGD (so called BinaryConnect) is widely used for training various weight quantized DNNs [9,43]. Recently, a Moreau envelope based relaxation method (BinaryRelax) is proposed and analyzed to advance weight quantization in DNN training [41]. Also a blended coarse gradient descent method [42] is introduced to train fully quantized DNNs in weights and activation functions, and overcome vanishing gradients. For earlier work on coarse gradient (a.k.a. straight through estimator), see [6,18,20] among others.
Non-SGD methods for deep learning include the Alternating Direction Method of Multipliers (ADMM) to transform a fully-connected neural network into an equality-constrained problem [35]; method of auxiliary coordinates (MAC) to replace a nested neural network with a constrained problem without nesting [8]. [47] handled deep supervised hashing problem by an ADMM algorithm to overcome vanishing gradients. For a similar model to (5) and treatment in a general context, see [1]; and in image processing, see [40]. For analysis and computation on minimization of (5) to learn a neural network with sparse weight and regular ReLU function, see [12].

Preliminaries
Consider a non-overlap one layer convolutional network, where the input feature Z ∈ R k×d is i.i.d. Gaussian random vector with zero mean and unit variance. Let G denote this distribution. Let σ be the binarized ReLU function: We define the training sample loss by where w * ∈ R d is the underlying (non-zero) teaching parameter. Note that (6) is invariant under scaling w → w/c, w * → w * /c, for any scalar c > 0. Without loss of generality, we assume w * = 1. Given independent training samples {Z 1 , ..., Z N }, the associated empirical risk minimization reads The empirical risk function in (7) is piece-wise constant and has a.e. zero partial w gradient. If σ were differentiable, then back-propagation would rely on: However, σ has zero derivative a.e., rendering (8) inapplicable. We study the coarse gradient descent with σ in (8) replaced by the (sub)derivative µ of the regular ReLU function µ(x) := max(x, 0). More precisely, we use the following surrogate of ∂l ∂w (w, Z): with µ (x) = σ(x). The constant 2 π will be necessary to give a stronger result for our main findings. To simplify our analysis, we let N ↑ ∞ in (7), so that its coarse gradient approaches E Z [g(w, Z)]. The following lemma asserts that E Z [g(w, Z)] has positive correlation with the true gradient ∇f (w), and consequently, −E Z [g(w, Z)] gives a reasonable descent direction.
Lemma 3.1. If θ(w, w * ) ∈ (0, π), and w = 0, then the inner product between the expected coarse and true gradient w.r.t. w is Suppose we want to train the network in a way that w t converges to a limit w in some neighborhood of w * , and we also want to promote sparsity in the limitw. To this end, a natural function to minimize is (for a parameter λ > 0): φ(w) = f (w) + λ w 1 , where other choices of a sparse penalty P include 0 and T 1 . Our proposed relaxed variable splitting (RVS) proceeds by first extending φ into a function of two variables f (w) + λ u 1 , then minimizing the Lagrangian function in (5) alternately in u and w. Minimization in u is the thresholding operation from penalty P , and is in closed form for 1 , 0 and T 1 . Minimization in w is through CGD. The splitting realizes sparsity more effectively than having P under CGD in case of 1 (T 1 ), and bypasses the non-existence of gradient in case of 0 . The resulting RSVCGD Algorithm is as follows: Here the update of w t has the form w t+1 = C t (w t − ηE Z [g(w t , Z)] − ηβ(w t − u t+1 )), where C t is some normalization constant. This normalization process is unique to our proposed algorithm, and is distinct from other common descent algorithms, for example ADMM, where the update of w has the form w t+1 ← arg min w L β (u t+1 , w, z t ) and z t is the Lagrange multiplier. Since f is non-convex and only Lipschitz differentiable away from zero, convergence analysis of ADMM in this case is beyond the current theory [39]. Here we circumvent the problem by updating w via CGD and then normalizing.

Main Results
Theorem 1. Suppose that the initialization and penalty parameters of the RVSCGD algorithm satisfy: 2π , and λ < k 2π √ d ; and that the learning rate η is small so that and η ≤ min 1 β+L , 2π k , where L is the Lipschitz constant in Lemma 5.3. Then the Lagrangian L β (u t , w t ) with 1 penalty is monotonically decreasing, and (u t , w t ) converges to a limit point (ū,w). Let θ := θ(w, w * ) and γ := θ(ū,w), then θ < δ, and the critical point (ū,w) satisfies where S λ/β is the soft-thresholding operator of 1 , for some positive constant C such that C ≤ k k−2πλ √ d ; and Remark. The sign of (w−S λ/β (w)) agrees withw. Thusw is a soft-thresholded version of w * , after some normalization. The assumption on η is reasonable, as will be shown below: E Z [g(w t , Z)] is bounded away from zero, and thus E Z [g(w t , Z)] + β(w t − u t+1 ) is also bounded.
Corollary 1.1. Suppose that the initialization of the RVSCGD algorithm satisfies the conditions in Theorem 1, and that the 1 penalty is replaced by 0 or T 1 . Then the RVSCGD iterations converge to a limit point (ū,w) satisfying equation (10) with 0 's hard thresholding operator [3] or T 1 thresholding [45] replacing S λ/β , and similar bounds (11)-(12) hold.

Proof of Theorem 1
The following Lemmas give the properties of the coarse gradient, as well as an outline for the proof of Theorem 1. The detail of each key step can be found in the proof of the corresponding Lemma.
Lemma 5.1. If every entry of Z is i.i.d. sampled from N (0, 1), w * = 1, and w = 0, then the true gradient of the population loss f (w) is for θ(w, w * ) ∈ (0, π); and the expected coarse gradient w.r.t. w is Lemma 5.2. (Properties of true gradient) Given w 1 , w 2 with min{ w 1 , w 2 } = c > 0 and max{ w 1 , w 2 } = C, there exists a constant L f > 0 depends on c and C such that Moreover, we have Lemmas 5.1, 5.2 follow directly from [42].

Lemma 5.3. (Properties of expected coarse gradient)
If w 1 , w 2 satisfy w 1 , w 2 = 1, and θ(w 1 , w * ), θ(w 2 , w * ) ∈ (0, π) with θ(w 2 , w * ) ≤ θ(w 1 , w * ), then there exists a constant K = k 2π such that Moreover, there exists a constant L = k 4π such that Remark. Here we show the coarse gradient satisfies the descent condition under very specific conditions. As shown in Lemma 5.4, we have θ t+1 ≤ θ t ; and the normalization of our RVSCGD model guarantees w t+1 = w t = 1. Thus the result of Lemma 5.3 is sufficient for our proof. Proof. First suppose w 1 = w 2 = 1. By Lemma 5.3 of [42], we have for j = 1, 2. Consider the plane formed by w j and w * , since w * = 1, we have an equilateral triangle formed by w j and w * (See Fig. 2).
Simple geometry shows Thus the expected coarse gradient simplifies to which implies with K = k 2π . The first claim is proved.

Proof of Corollary
Lemma 5.7.
[3] Let f λ,x (y) = 1 2 (y−x) 2 +λ y 0 . Then y * λ (x) = arg min y f λ,x (y) is the 0 hard thresholding y * We proceed by an outline similar to the proof of Theorem 1: Step 1. First we show that L β,T 1 (u t , w t ) and L β,0 (u t , w t ) both decrease under the update of u t and w t . To see this, notice that the update on u t decreases L β,T 1 (u t , w t ) and L β,0 (u t , w t ) by definition. Then, for a fixed u = u t+1 , the update on w t decreases L β,T 1 (u t , w t ) and L β,0 (u t , w t ) by a similar argument to that found in Theorem 1.
Step 3. Finally, the equilibrium condition from equation (22) still holds for the critical point, and a similar argument shows that θ(w, w * ) < δ.

Conclusion
We introduced a variable splitting coarse gradient descent method to learn a one-hidden layer neural network with sparse weight and binarized activation in a regression setting. The proof is based on the descent of a Lagrangian function and the angle between the sparse and true weights, and applies to 1 , 0 and T 1 sparse penalties. We plan to extend our work to a classification setting in the future.

Acknowledgement
The work was partially supported by NSF grant IIS-1632935. The authors thank Dr. Penghang Yin for the helpful comments.