Structured Sparsity of Convolutional Neural Networks via Nonconvex Sparse Group Regularization

Bui, Kevin; Park, Fredrick; Zhang, Shuai; Qi, Yingyong; Xin, Jack

doi:10.3389/fams.2020.529564

ORIGINAL RESEARCH article

Front. Appl. Math. Stat., 24 February 2021

Sec. Mathematics of Computation and Data Science

Volume 6 - 2020 | https://doi.org/10.3389/fams.2020.529564

This article is part of the Research TopicFundamental Mathematical Topics in Data ScienceView all 7 articles

Structured Sparsity of Convolutional Neural Networks via Nonconvex Sparse Group Regularization

Kevin Bui¹

Fredrick Park²

Shuai Zhang¹

Yingyong Qi¹

Jack Xin¹*

¹Department of Mathematics, University of California, Irvine, Irvine, CA, United States
²Department of Mathematics and Computer Science, Whittier College, Whittier, CA, United States

Convolutional neural networks (CNN) have been hugely successful recently with superior accuracy and performance in various imaging applications, such as classification, object detection, and segmentation. However, a highly accurate CNN model requires millions of parameters to be trained and utilized. Even to increase its performance slightly would require significantly more parameters due to adding more layers and/or increasing the number of filters per layer. Apparently, many of these weight parameters turn out to be redundant and extraneous, so the original, dense model can be replaced by its compressed version attained by imposing inter- and intra-group sparsity onto the layer weights during training. In this paper, we propose a nonconvex family of sparse group lasso that blends nonconvex regularization (e.g., transformed $ℓ_{1}$ , $ℓ_{1} - ℓ_{2}$ , and $ℓ_{0}$ ) that induces sparsity onto the individual weights and $ℓ_{2,1}$ regularization onto the output channels of a layer. We apply variable splitting onto the proposed regularization to develop an algorithm that consists of two steps per iteration: gradient descent and thresholding. Numerical experiments are demonstrated on various CNN architectures showcasing the effectiveness of the nonconvex family of sparse group lasso in network sparsification and test accuracy on par with the current state of the art.

1 Introduction

Deep neural networks (DNNs) have proven to be advantageous for numerous modern computer vision tasks involving image or video data. In particular, convolutional neural networks (CNNs) yield highly accurate models with applications in image classification [28, 39, 77, 95], semantic segmentation [13, 49], and object detection [30, 72, 73]. These large models often contain millions of weight parameters that often exceed the number of training data. This is a double-edged sword since on one hand, large models allow for high accuracy, while on the other, they contain many redundant parameters that lead to overparametrization. Overparametrization is a well-known phenomenon in DNN models [6, 17] that results in overfitting, learning useless random patterns in data [96], and having inferior generalization. Additionally, these models also possess exorbitant computational and memory demands during both training and inference. Consequently, they may not be applicable for devices with low computational power and memory.

Resolving these problems requires compressing the networks through sparsification and pruning. Although removing weights might affect the accuracy and generalization of the models, previous works [25, 54, 66, 81] demonstrated that many networks can be substantially pruned with negligible effect on accuracy. There are many systematic approaches to achieving sparsity in DNNs, as discussed extensively in Refs. 14 and 15.

Han et al. [26] proposed to first train a dense network, prune it afterward by setting the weights to zeroes if below a fixed threshold, and retrain the network with the remaining weights. Jin et al. [32] extended this method by restoring the pruned weights, training the network again, and repeating the process. Rather than pruning by thresholding, Aghasi et al. [1, 2] proposed Net-Trim, which prunes an already trained network layer by layer using convex optimization in order to ensure that the layer inputs and outputs remain consistent with the original network. For CNNs in particular, filter or channel pruning is preferred because it significantly reduces the amount of weight parameters required compared to individual weight pruning. Li et al. [43] calculated the sums of absolute weights of the filters of each layer and pruned the ones with the smallest sums. Hu et al. [29] proposed a metric called average percentage of zeroes for channels to measure their redundancies and pruned those with highest values for each layer. Zhuang et al. [105] developed discrimination-aware channel pruning that selects channels that contribute to the network’s discriminative power.

An alternative approach to pruning a dense network is learning a compressed structure from scratch. A conventional approach is to optimize the loss function equipped with either the $ℓ_{1}$ or $ℓ_{2}$ regularization, which drives the weights to zeroes or to very small values during training. To learn which groups of weights (e.g., neurons, filters, channels) are necessary, group regularization, such as group lasso [93] and sparse group lasso [76], are equipped to the loss function. Alvarez and Salzmann [4] and Scardapane et al. [75] applied group lasso and sparse group lasso to various architectures and obtained compressed networks with comparable or even better accuracy. Instead of sharing features among the weights as suggested by group sparsity, exclusive sparsity [104] promotes competition for features between different weights. This method was investigated by Yoon and Hwang [92]. In addition, they combined it with group sparsity and demonstrated that this combination resulted in compressed networks with better performance than their original counterparts. Non-convex regularization has also been examined. Louizos et al. [54] proposed a practical algorithm using probabilistic methods to perform $ℓ_{0}$ regularization on CNNs. Ma et al. [61] proposed integrated transformed $ℓ_{1}$ , a convex combination of transformed $ℓ_{1}$ and group lasso, and compared its performance against the aforementioned group regularization methods.

In this paper, we propose a family of group regularization methods that balances both group lasso for group-wise sparsity and nonconvex regularization for element-wise sparsity. The family extends sparse group lasso by replacing the $ℓ_{1}$ penalty term with a nonconvex penalty term. The nonconvex penalty terms considered are $ℓ_{0}$ , $ℓ_{1} - α ℓ_{2}$ , transformed $ℓ_{1}$ , and SCAD. The proposed family is supposed to yield a more accurate and/or more compressed network than sparse group lasso since $ℓ_{1}$ suffers various weaknesses due to being a convex relaxation of $ℓ_{0}$ . We develop an algorithm to optimize loss functions equipped with the proposed nonconvex, group regularization terms for DNNs.

2 Model and Algorithm

2.1 Preliminaries

Given a training dataset consisting of N input-output pairs ${(x_{i}, y_{i})}_{i = 1}^{N}$ , the weight parameters of a DNN are learned by optimizing the following objective function:

min_{W} \frac{1}{N} \sum_{i = 1}^{N} ℒ [h (x_{i}, W), y_{i}] + λ ℛ (W), (1)

where

W is the set of weight parameters of the DNN.

$h (\cdot, \cdot)$ is the output of the DNN used for prediction.

$ℒ (\cdot, \cdot) \geq 0$ is the loss function that compares the prediction $h (x_{i}, W)$ with the ground-truth output $y_{i}$ . Examples include cross-entropy loss function for classification and mean-squared error for regression.

$ℛ (\cdot)$ is the regularizer on the set of weight parameters W.

$λ > 0$ is a regularization parameter for $ℛ (\cdot)$ .

The most common regularizer used for DNNs is $ℓ_{2}$ regularization ${‖ \cdot ‖}_{2}^{2}$ , also known as weight decay. It prevents overfitting and improves generalization because it enforces the weights to decrease proportionally to their magnitudes [40]. Sparsity can be imposed by pruning weights whose magnitudes are below a certain threshold at each iteration during training. However, an alternative regularizer is the $ℓ_{1}$ norm ${‖ \cdot ‖}_{1}$ , also known as the lasso penalty [78]. The $ℓ_{1}$ norm is the tightest convex relaxation of the $ℓ_{0}$ penalty [20, 23, 82] and it yields a sparse solution that is found on the corners of the 1-norm ball [27, 52]. Theoretical results justify the $ℓ_{1}$ norm’s ability to reconstruct sparse solution in compressed sensing. When a sensing matrix satisfies the restricted isometry property, the $ℓ_{1}$ norm recovers the sparse solution exactly with high probability [11, 23, 82]. On the other hand, the null space property is a necessary and sufficient condition for $ℓ_{1}$ minimization to guarantee exact recovery of sparse solutions [16, 23]. Being able to yield sparse solutions, the $ℓ_{1}$ norm has gained popularity in other types of inverse problems such as compressed imaging [33, 57] and image segmentation [34, 35, 42] and in various fields of applications such as geoscience [74], medical imaging [33, 57], machine learning [10, 36, 67, 78, 89], and traffic flow network [91]. Unfortunately, element-wise sparsity by $ℓ_{1}$ or $ℓ_{2}$ regularization in CNNs may not yield meaningful speedup as the number of filters and channels required for computation and inference may remain the same [86].

To determine which filters or channels are relevant in each layer, group sparsity using the group lasso penalty [93] is considered. The group lasso penalty has been utilized in various applications, such as microarray data analysis [62], machine learning [7, 65], and EEG data [46]. Suppose a DNN has L layers, so the set of weight parameters W is divided into L sets of weights: $W = {W_{l}}_{l = 1}^{L}$ . The weight set of each layer $W_{l}$ is divided into $N_{l}$ groups (e.g., channels or filters): $W_{l} = {w_{l, g}}_{g = 1}^{N_{l}}$ . The group lasso penalty applied to $W_{l}$ is formulated as

ℛ_{G L} (W_{l}) = \sum_{g = 1}^{N_{l}} \sqrt{# w_{l, g}} {| | w_{l, g} | |}_{2} = \sum_{g = 1}^{N_{l}} \sqrt{# w_{l, g}} \sqrt{\sum_{i = 1}^{# w_{l, g}} w_{l, g, i}^{2}}, (2)

where $w_{l, g, i}$ corresponds to the weight parameter with index i in group g in layer l and the term $# w_{l, g}$ denotes the number of weight parameters in group g in layer l. Because group sizes vary, the constant $\sqrt{# w_{l, g}}$ is multiplied in order to rescale the $ℓ_{2}$ norm of each group with respect to the group size, ensuring that each group is weighed uniformly [65, 76, 93]. The group lasso regularizer imposes the $ℓ_{2}$ norm on each group, forcing weights of the same groups to decrease altogether at every iteration during training. As a result, the groups of weights are pruned when their $ℓ_{2}$ norms are negligible, resulting in a highly compact network compared to element-sparse networks.

As an alternative to group lasso that encourages feature sharing, exclusive sparsity [104] enforces the model weight parameters to compete for features, making the features discriminative for each class in the context of classification. The regularization for exclusive sparsity is

\frac{1}{2} \sum_{g = 1}^{N_{l}} {| | w_{l, g} | |}_{1}^{2} = \frac{1}{2} \sum_{g = 1}^{N_{l}} {(\sum_{i = 1}^{# w_{l, g}} | w_{l, g, i} |)}^{2} . (3)

Now, within each group, sparsity is enforced. Because exclusivity cannot guarantee the optimal features since some features do need to be shared, exclusive sparsity can be combined with group sparsity to form combined group and exclusive sparsity (CGES) [92]. CGES is formulated as

ℛ_{C G E S} = \sum_{g = 1}^{N_{l}} [(1 - μ_{l}) \sqrt{\sum_{i = 1}^{# w_{l, g}} w_{l, g, i}^{2}} + \frac{μ_{l}}{2} {(\sum_{i = 1}^{# w_{l, g}} | w_{l, g, i} |)}^{2}], (4)

where $μ_{l} \in (0,1)$ is a parameter for balancing exclusivity and sharing among features.

To obtain an even sparser network, element-wise sparsity and group sparsity can be combined and applied together to the training of DNNs. One regularizer that combines these two types of sparsity is the sparse group lasso penalty [76], which is formulated as

ℛ_{S G L_{1}} (W_{l}) = ℛ_{G L} (W_{l}) + {‖ W_{l} ‖}_{1} (5)

where

{‖ W_{l} ‖}_{1} = \sum_{g = 1}^{N_{l}} \sum_{i = 1}^{# w_{l, g}} | w_{l, g, i} | .

Sparse group lasso simultaneously enforces group sparsity by having the regularizer $ℛ_{G L} (\cdot)$ and element-wise sparsity by having the $ℓ_{1}$ norm. This regularizer has been used in machine learning [83], bioinformatics [48, 103], and medical imaging [47].

Figure 1 demonstrates the differences between lasso, group lasso, and sparse group lasso applied to a weight matrix connecting a 5-dimensional input layer to a 10-dimensional output layer. In white, the entries are zero’ed out; in gray; the entries are not. Unlike lasso, group lasso results in a more structured method of pruning since three of the five neurons can be zero’ed out. Combined with $ℓ_{1}$ regularization on the individual weights, sparse group lasso allows for more weights in the remaining two neurons to be pruned.

FIGURE 1

FIGURE 1. Comparison between lasso, group lasso, and sparse group lasso applied to a weight matrix. Entries in white are zero’ed out or removed; entries in gray remain.

2.2 Nonconvex Sparse Group Lasso

We recall that the $ℓ_{1}$ norm is the tightest convex relaxation of the $ℓ_{0}$ penalty, given by

{| | W_{l} | |}_{0} = \sum_{g = 1}^{N_{l}} \sum_{i = 1}^{# w_{l, g}} | w_{l, g, i} |_{0} (6)

where

| w |_{0} = {\begin{matrix} 1 if w \neq 0 \\ 0 if w = 0 \end{matrix}

when applied to the weight set $W_{l}$ of layer l. The $ℓ_{0}$ penalty is non-convex and discontinuous. In addition, any $ℓ_{0}$ -regularized problem is NP-hard [23]. These properties make developing convergent and tractable algorithms for $ℓ_{0}$ -regularized problems difficult, thereby making $ℓ_{1}$ -regularized problems better alternatives to solve. However, the $ℓ_{0}$ -regularized problems have been shown to recover better solutions in terms of sparsity and/or accuracy than do $ℓ_{1}$ -regularized problems in various applications, such as compressed sensing [56], image restoration [8, 12, 19, 55, 102], MRI reconstruction [80], and machine learning [56, 94]. In particular, $ℓ_{0}$ -regularized inverse problems were demonstrated to be more robust against Poisson noise than are $ℓ_{1}$ -regualarized inverse problems [100].

A continuous alternative to the $ℓ_{0}$ penalty is the SCAD penalty term [22, 58], given by

λ {| | W_{l} | |}_{SCAD (a)} = \sum_{g = 1}^{N_{l}} \sum_{i = 1}^{# w_{l, g}} λ | w_{l, g, i} |_{SCAD (a)} (7)

where

λ | w |_{SCAD (a)} : = {\begin{matrix} λ | w | & if | w | < λ \\ \frac{2 a λ | w | - w^{2} - λ^{2}}{2 (a - 1)} & if λ \leq | w | < a λ \\ (a + 1) λ^{2} / 2 & if | w | \geq a λ \end{matrix}

for $λ > 0$ and $a > 2$ . This penalty term enjoys three properties – unbiasedness, sparsity, and continuity – while the $ℓ_{1}$ norm, on the other hand, has only sparsity and continuity [22]. In linear and logistic regression, SCAD was shown to outperform $ℓ_{1}$ in variable selection [22]. SCAD has been applied to wavelet approximation [5], bioinformatics [9, 84], and compressed sensing [64].

The transformed $ℓ_{1}$ penalty term [68] also enjoys the properties of unbiasedness, sparsity, and continuity [58]. In fact, the regularizer is not just continuous but Lipschitz continuous [98]. The term is given by

{| | W_{l} | |}_{TL 1 (a)} = \sum_{g = 1}^{N_{l}} \sum_{i = 1}^{# w_{l, g}} | w_{l, g, i} |_{TL 1 (a)} (8)

where

| w |_{TL 1 (a)} = \frac{(a + 1) | w |}{a + | w |} .

In addition, it interpolates the $ℓ_{0}$ and $ℓ_{1}$ penalties through the parameter a [98] because

lim_{a \to 0^{+}} | w |_{TL 1 (a)} = | w |_{0} and lim_{a \to \infty} | w |_{TL 1 (a)} = | w | .

The transformed $ℓ_{1}$ penalty term was investigated and was shown to outperform $ℓ_{1}$ in compressed sensing [79, 97, 98], deep learning [45, 61, 87], matrix completion [99], and epidemic forecasting [45].

Another Lipschitz continuous, nonconvex regularizer is the $ℓ_{1} - α ℓ_{2}$ penalty given by

{| | W_{l} | |}_{ℓ_{1} - α ℓ_{2}} = {| | W_{l} | |}_{1} - α {| | W_{l} | |}_{2} = \sum_{g = 1}^{N_{l}} \sum_{i = 1}^{# w_{l, g}} | w_{l, g, i} | - α \sqrt{\sum_{g = 1}^{N_{l}} \sum_{i = 1}^{# w_{l, g}} | w_{l, g, i} |^{2}}, (9)

where $α \in (0,1]$ . In a series of works [50–52, 90], the penalty term $ℓ_{1} - ℓ_{2}$ with $α = 1$ yields better solutions than does $ℓ_{1}$ in various compressed sensing applications especially when the sensing matrix is highly coherent or it violates the restricted isometry property condition. To guarantee exact recovery of sparse solution, $ℓ_{1} - ℓ_{2}$ only requires a relaxed variant of the null space property [79]. Furthermore, $ℓ_{1} - α ℓ_{2}$ is more robust against impulsive noise in yielding sparse, accurate solutions for inverse problems than is $ℓ_{1}$ [44]. Besides compressed sensing, it has been utilized in image denoising and deblurring [53], image segmentation [71], image inpainting [63], and hyperspectral demixing [21]. In deep learning application, the $ℓ_{1} - ℓ_{2}$ regularization was used to learn permutation matrices [59] for ShuffleNet [60, 101].

Due to the advantages and recent successes of the aforementioned nonconvex regularizers, we propose to replace the $ℓ_{1}$ norm in Eq. 5 with nonconvex penalty terms. Hence, we propose a family of group regularizers called nonconvex sparse group lasso. The family includes the following:

ℛ_{S G L_{0}} (W_{l}) = ℛ_{G L} (W_{l}) + {| | W_{l} | |}_{0} (10)

ℛ_{S G S C A D (a)} (W_{l}) = ℛ_{G L} (W_{l}) + {| | W_{l} | |}_{SCAD (a)} (11)

ℛ_{S G T L_{1} (a)} (W_{l}) = ℛ_{G L} (W_{l}) + {| | W_{l} | |}_{TL 1 (a)} (12)

ℛ_{S G L_{1} - α L_{2}} (W_{l}) = ℛ_{G L} (W_{l}) + {| | W_{l} | |}_{ℓ_{1} - α ℓ_{2}} . (13)

Using these regularizers, we expect to obtain a sparser and/or more accurate network than from using the original sparse group lasso. The $ℓ_{1}$ norm can also be replaced with other nonconvex penalties not mentioned in this paper. Refer to Refs. 3 and 85 to see other nonconvex penalties. However, we focus on the aforementioned nonconvex regularizers because they have closed-form proximal operators required by our proposed algorithm described in the next section.

2.3 Notations and Definitions

Before discussing the algorithm, we summarize notations that we will use to save space. They are the following:

If $V = {V_{l}}_{l = 1}^{L}$ and $W = {W_{l}}_{l = 1}^{L}$ , then $(V, W) : = ({V_{l}}_{l = 1}^{L}, {W_{l}}_{l = 1}^{L}) = (V_{1}, \dots, V_{L}, W_{1}, \dots, W_{L})$

$V^{+} : = V^{k + 1}$

$\tilde{ℒ} (W) : = \frac{1}{N} {\sum^{}}_{i = 1}^{N} ℒ (h (x_{i}, W), y_{i})$

In addition, we define the proximal operator for the regularization function

r (\cdot)

as follows:

{prox}_{λ r} (y) = arg min_{x} λ r (x) + \frac{1}{2} {| | x - y | |}_{2}^{2}

for $λ > 0$ .

2.4 Numerical Optimization

We develop a general algorithm framework to solve

min_{W} \tilde{ℒ} (W) + λ \sum_{l = 1}^{L} ℛ (W_{l}) = \tilde{ℒ} (W) + \sum_{l = 1}^{L} [λ ℛ_{G L} (W_{l}) + λ r (W_{l})] (14)

where $W = {W_{l}}_{l = 1}^{L}$ , $ℛ$ is either $ℛ_{S G L_{1}}$ or one of the nonconvex regularizers Eqs. 10–13, and $r (\cdot)$ is the corresponding sparsity-inducing regularizer. Throughout the paper, our assumption on Eq. 14 is the following:

Assumption 1. The function $\tilde{ℒ}$ is continuously differentiable with respect to $W_{l}$ for each $l = 1, \dots, L$ .

By introducing an auxiliary variable $V = {V_{l}}_{l = 1}^{L}$ for (14), we have a constrained optimization problem:

\begin{array}{l} min_{V, W} \tilde{ℒ} (W) + \overset{L}{\sum_{l = 1}} (λ ℛ_{G L} (W_{l}) + λ r (V_{l})) \\ s .t . V_{l} = W_{l} l = 1, \dots, L . \end{array} (15)

The constraints can be relaxed by adding the quadratic penalty terms with $β > 0$ so that we have

min_{V, W} F_{β} (V, W) : = \tilde{ℒ} (W) + \sum_{l = 1}^{L} [λ ℛ_{G L} (W_{l}) + λ r (V_{l}) + \frac{β}{2} {| | V_{l} - W_{l} | |}_{2}^{2}] . (16)

With β fixed, Eq. 16 can be solved by alternating minimization:

W^{k + 1} = {arg  min}_{W} F_{β} (V^{k}, W) (17a)

V^{k + 1} = {arg  min}_{V} F_{β} (V, W^{k + 1}) . (17b)

To solve Eq. 17a, we simultaneously update $W_{l}$ for $l = 1, \dots L$ by gradient descent

W_{l}^{k + 1} = W_{l}^{k} - γ [\nabla_{W_{l}} \tilde{ℒ} (W^{k}) + λ \partial_{W_{l}} ℛ_{G L} (W_{l}^{k}) - β (V_{l}^{k} - W_{l}^{k})] (18)

where $γ > 0$ is the learning rate and $\partial_{W_{l}} ℛ_{G L}$ is the subdifferential of $ℛ_{G L}$ with respect to $W_{l}$ . In practice, Eq. 18 is performed using stochastic gradient descent (or one of its variants) with mini-batches due to the large-size computation dealing with the amount of data and weight parameters that a typical DNN has.

To update V, we see that Eq. 17b can be rewritten as

V^{k + 1} = {arg  min}_{V} \sum_{l = 1}^{L} (\frac{λ}{β} r (V_{l}) + \frac{1}{2} {| | V_{l} - W_{l} | |}_{2}^{2}) = ({prox}_{\frac{λ}{β} r} (W_{1}), \dots, {prox}_{\frac{λ}{β} r} (W_{L})) . (19)

The proximal operators for the considered regularizers are thresholding functions as their closed-form solutions, and as a result, the V update simplifies to thresholding W. The regularization functions and their corresponding proximal operators are summarized in Table 1.

TABLE 1

TABLE 1. Regularization penalties and their corresponding proximal operators with $λ > 0$ .

ALGORITHM 1:

Algorithm 1:. Algorithm for Nonconvex Sparse Group Lasso RegularizationFAMS_fams-2020-529564_gs_fx1

Incorporating the algorithm that solves the quadratic penalty problem Eq. 16, we now develop a general algorithm to solve Eq. 14. We solve a sequence of quadratic penalty problems Eq. 16 with $β \in {β_{j}}_{j = 1}^{\infty}$ where $β_{j} ↑ \infty$ . This will yield a sequence ${(V^{j}, W^{j})}_{j = 1}^{\infty}$ so that $W^{j} \to W^{*}$ , a solution to (14). This algorithm is based on the quadratic penalty method [69] and the penalty decomposition method [56]. The algorithm is summarized in Algorithm 1.

An alternative algorithm to solve Eq. 14 is proximal gradient descent [70]. By this method, the update for $W_{l}, l = 1, \dots, L$ , is

W_{l}^{k + 1} = {prox}_{γ λ r} {W_{l}^{k} - γ [\nabla_{W_{l}} \tilde{ℒ} (W^{k}) + λ \partial_{W_{l}} ℛ_{G L} (W_{l}^{k})]} . (20)

Using this algorithm results in weight parameters with some already zero’ed out.

However, the advantage of our proposed algorithm lies in Eq. 17a, written more specifically as

W_{l}^{k + 1} = {arg  min}_{W_{l}} \tilde{ℒ} (W) + ℛ_{G L} (W_{l}) + \frac{β}{2} {| | V_{l} - W_{l} | |}_{2}^{2} (21)

= {arg  min}_{W_{l}} \tilde{ℒ} (W) + ℛ_{G L} (W_{l}) + \frac{β}{2} \overset{# W_{l}}{\sum_{i = 1}} {(v_{l, i} - w_{l, i})}^{2} .

We see that this step performs exact weight decay or $ℓ_{2}$ regularization on weights $w_{l, i}$ whenever $v_{l, i} = 0$ . On the other hand, when $v_{l, i} \neq 0$ , the effect of $ℓ_{2}$ regularization is mitigated on the corresponding weight $w_{l, i}$ based on the absolute difference $| v_{l, i} - w_{l, i} |$ . Using $ℓ_{2}$ regularization was shown to give superior pruning results in terms of accuracy by Han et al. [26]. Our proposed algorithm can be perceived as an adaptive $ℓ_{2}$ regularization method, where Eq. 17b identifies which weights to perform exact $ℓ_{2}$ regularization on and Eq. 17a updates and regularizes the weights accordingly.

2.5 Convergence Analysis

To establish convergence for the proposed algorithm, the results below state that the accumulation point of the sequence generated by Eqs 17a and 17b is a block-coordinate minimizer, and an accumulation point generated by Algorithm 1 is a sparse feasible solution to (15). Proofs are provided in Section 5. Unfortunately, the feasible solution generated may not be a local minimizer of Eq. 15 because the loss function $ℒ (\cdot, \cdot)$ is nonconvex. However, it was shown in [18] that a similar algorithm to Algorithm 1, but for fixed β in a bounded interval, generates an approximate global solution with high probability for a one-layer CNN with ReLu activation function.

Theorem 2. Let ${(V^{k}, W^{k})}_{k = 1}^{\infty}$ be a sequence generated by the alternating minimization algorithm Eqs. 17a and 17b, where $r (\cdot)$ is $ℓ_{0}$ , $ℓ_{1}$ , transformed $ℓ_{1}$ , $ℓ_{1} - α ℓ_{2}$ , or SCAD. If $(V^{*}, W^{*})$ is an accumulation point of ${(V^{k}, W^{k})}_{k = 1}^{\infty}$ , then $(V^{*}, W^{*})$ is a block-coordinate minimizer of Eq. 16. that is

V^{*} \in arg min_{V} F_{β} (V, W^{*})

W^{*} \in arg min_{W} F_{β} (V^{*}, W) .

Theorem 3. Let ${(V^{k}, W^{k}, β_{k})}_{k = 1}^{\infty}$ be a sequence generated by Algorithm 1. Suppose that ${F_{β_{k}} (V^{k}, W^{k})}_{k = 1}^{\infty}$ is uniformly bounded. If $(V^{*}, W^{*})$ is an accumulation point of ${(V^{k}, W^{k})}_{k = 1}^{\infty}$ , then $(V^{*}, W^{*})$ is a feasible solution to Eq. 15, that is $V^{*} = W^{*}$ .

Remark: To safely ensure that ${F_{β_{k}} (V^{k}, W^{k})}_{k = 1}^{\infty}$ is uniformly bounded in practice, we can find a feasible solution $(V^{feas}, W^{feas})$ to (15) and impose a bound M such that

M \geq max {\tilde{L} (W^{feas}) + λ \sum_{l = 1}^{L} ℛ (W_{l}^{feas}), min_{W} F_{β_{0}} (V^{1}, W)} .

If ${min}_{W} F_{β_{k + 1}} (V^{k}, W) > M$ , then we set $V^{k + 1} = W^{feas}$ . This strategy is based on Ref. 56. However, in our numerical experiments, we have not yet encountered $F_{β_{k}} (V^{k}, W^{k})$ to diverge.

3 Numerical Experiments

3.1 Application to Deep Neural Networks

We compare the proposed nonconvex sparse group lasso against four other methods as baselines: group lasso, sparse group lasso ( $S G L_{1}$ ), CGES proposed in Ref. 92, and the group variant of $ℓ_{0}$ regularization (denoted as $ℓ_{0}$ for simplicity) proposed in Ref. 54. $S G L_{1}$ is optimized using the same algorithm proposed for nonconvex sparse group lasso. For the group terms, the weights are grouped together based on the filters or output channels, which we will refer to as neurons. We trained various CNN architectures on MNIST [41] and CIFAR 10/100 [38]. The MNIST dataset consists of 60k training images and 10k test images. MNIST is trained on two simple CNN architectures: LeNet-5-Caffe [31, 41] and a 4-layer CNN with two convolutional layers (32 and 64 channels, respectively) and an intermediate layer of 1000 fully connected neurons. CIFAR 10/100 is a dataset that has 10/100 classes split into 50k training images and 10k test images. It is trained on Resnets [28] and wide Resnets [95]. Throughout all of our experiments, for $S G S C A D (a)$ , we set $a = 3.7$ as suggested in [22]; for $S G T L_{1} (a)$ , we set $a = 1.0$ as suggested in Ref. 99; and for $S G L_{1} - L_{2}$ , we set $α = 1.0$ as suggested by the literatures [50–52, 90]. For CGES, we have $μ_{l} = l / L$ . Because the optimization algorithms do not drive most, if not all, the weights and neurons to zeroes, we have to set them to zeroes when their values are below a certain threshold. In our experiments, if the absolute weights are below $10^{- 5}$ , we set them to zeroes. Then, weight sparsity is defined to be the percentage of zero weights with respect to the total number of weights trained in the network. If the normalized sum of the absolute values of the weights of the neuron is less than $10^{- 5}$ , then the weights of the neuron are set to zeroes. Neuron sparsity is defined to be the percentage of neurons whose weights are zeroes with respect to the total number of neurons in the network.

3.1.1 MNIST Classification

MNIST is trained on Lenet-5-Caffe, which has four layers with 1,370 total neurons and 431,080 total weight parameters. All layers of the network are applied with strictly the same type of regularization. No other regularization methods (e.g., dropout and batch normalization) are used. The network is optimized using Adam [37] with initial learning rate 0.001. For every 40 epochs, the learning rate decays by a factor of 0.1. We set the regularization parameter to the following values: $λ = α / 60000$ for $α \in {0.1, 0.2, 0.3, 0.4, 0.5}$ . For $S G L_{1}$ and nonconvex sparse group lasso, we set $β = 25 α / 60000$ , and for every 40 epochs, β increases by a factor of $σ = 1.25$ . The network is trained for 200 epochs across 5 runs.

Table 2 reports the mean results for test error, weight sparsity, and neuron sparsity across five runs of Lenet-5-Caffe trained after 200 epochs. We see that although CGES has the lowest test errors at $α \in {0.1, 0.3, 0.4}$ and the largest weight sparsity for all $α \in {0.1, 0.2, \dots, 0.5}$ , nonconvex sparse group lasso’s test errors and weight sparsity are comparable. Additionally, nonconvex sparse group lasso’s neuron sparsity is nearly two times larger than the neuron sparsity attained by CGES. Across all parameters and methods, SGL₀ with $α = 0.5$ attains the best average test error of 0.630 with average weight sparsity 95.7% and neuron sparsity 80.7%. Furthermore, its test error is lower than the test errors of other nonconvex sparse group lasso regularization methods for all α’s tested. Generally, $S G L_{1}$ and nonconvex sparse group lasso outperform $ℓ_{0}$ regularization proposed by Louizos et al. [54] and group lasso by average weight and neuron sparsity.

TABLE 2

TABLE 2. Average test error, weight sparsity, and neuron sparsity of Lenet-5 models trained on MNIST after 200 epochs across 5 runs. Standard deviations are in parentheses.

Table 3 reports the mean results for test error, weight sparsity, and neuron sparsity of the Lenet-5-Caffe models with the lowest test errors from the five runs. According to the results, the best test errors are attained by $S G L_{0}$ at $α = 0.3, 0.5$ ; $S G L_{1} - L_{2}$ at $α = 0.2$ ; and CGES at $α = 0.1, 0.4$ . For average weight sparsity, $S G L_{0}$ attains the largest weight sparsity at $α \in {0.2, 0.3, 0.4, 0.5}$ . For average neuron sparsity, the largest values are attained by $S G T L_{1}$ at $α = 0.1, 0.2$ ; by $S G L_{1}$ at $α = 0.3$ ; and by $S G L_{0}$ at $α = 0.4, 0.5$ . Although $S G L_{0}$ does not outperform all the other methods across the board, its results are still comparable to the best results. Overall, we see that nonconvex sparse group lasso outperforms $ℓ_{0}$ in test error, weight sparsity, and neuron sparsity and group lasso in weight and neuron sparsity.

TABLE 3

TABLE 3. Average test error, weight sparsity, and neuron sparsity of Lenet-5 models trained on MNIST with lowest test errors across 5 runs. Standard deviations are in parentheses.

MNIST is also trained on a 4-layer CNN with two convolutional layers with 32 and 64 channels, respectively, and an intermediate layer with 1000 neurons. Each convolutional layer has a $5 \times 5$ convolutional filters. The 4-layer CNN has 2,120 total neurons and 1,087,010 total weight parameters. All layers of the network are applied with strictly the same type of regularization. The network is optimized with the same settings as Lenet-5-Caffe. However, the regularization parameter is different: we have $λ = α / 60000$ for $α \in {0.2, 0.4, 0.6, 0.8, 1.0}$ . For $S G L_{1}$ and nonconvex sparse group lasso, we set $β = 5 α / 60000$ and for every 40 epochs, β increases by a factor of $σ = 1.25$ . The network is trained for 200 epochs across 5 runs.

Table 4 reports the mean results for test error, weight sparsity, and neuron sparsity across five runs of the 4-layer CNN models trained after 200 epochs. Although CGES consistently has the highest weight sparsity, it does not yield the most accurate models until when $α \geq 0.8$ . Moreover, its neuron sparsity is smaller than the neuron sparsity by group lasso, $S G L_{1}$ , and nonconvex group lasso when $α \geq 0.6$ . $ℓ_{0}$ has the highest neuron sparsity for all α’s given, but its test errors are much greater. When $α \leq 0.6$ , $S G S C A D$ yields the most accurate models at $α = 0.2, 0.6$ while $S G L_{1}$ yields one at $α = 0.4$ . Overall, we see that nonconvex group lasso has comparable weight sparsity and neuron sparsity as group lasso and $S G L_{1}$ .

TABLE 4

TABLE 4. Average test error, weight sparsity, and neuron sparsity of 4-layer CNN models trained on MNIST after 200 epochs across 5 runs. Standard deviations are in parentheses.

Table 5 reports the mean results for test error, weight sparsity, and neuron sparsity of the 4-layer CNN models with the lowest test errors from the five runs. At $α = 0.2$ , $S G L_{1}$ and $S G S C A D$ have the lowest test errors, but their weight sparsity are exceeded by CGES and their neuron sparsity are exceeded by $ℓ_{0}$ . At $α = 0.4$ , $S G L_{1} - L_{2}$ has the lowest test error, but its weight sparsity and neuron sparsity are exceeded by CGES and $ℓ_{0}$ , respectively. At $α = 0.6$ , $S G L_{1}$ has the lowest test error, but $S G S C A D$ has the largest weight sparsity with comparable test error. At $α \geq 0.8$ , CGES has the lowest test error, but its weight sparsity is exceeded by group lasso, $S G L_{1}$ , and the nonconvex group lasso regularizers, which all have slightly higher test error. At $α = 0.8$ , the neuron sparsity of CGES is comparable to the neuron sparsity of group lasso, $S G L_{1}$ , and the nonconvex group lasso regularizers. At $α = 1.0$ , group lasso has the highest neuron sparsity, but nonconvex group lasso has slightly lower neuron sparsity. In general, weight sparsity of nonconvex group lasso is comparable to or larger than the weight sparsity of group lasso and $S G L_{1}$ .

TABLE 5

TABLE 5. Average test error, weight sparsity, and neuron sparsity of 4-layer CNN models trained on MNIST with lowest test errors across 5 runs. Standard deviations are in parentheses.

3.1.2 CIFAR Classification

CIFAR 10/100 is trained on Resnet-40 and wide Resnet with depth 28 and width 10 (WRN-28-10). Resnet-40 has approximately 570,000 weight parameters and 1520 neurons while WRN-28-10 has approximately 36,500,000 weight parameters and 10,736 neurons. The networks are optimized using stochastic gradient descent with initial learning rate 0.1. After every 60 epochs, learning rate decays by a factor of 0.2. Strictly the same type of regularization is applied to the weights of the hidden layer where dropout is utilized in the residual block. We vary the regularization parameter $λ = α / 50000$ . For Resnet-40, we have $α \in {1.0, 1.5, 2.0, 2.5, 3.0}$ for CIFAR 10 and $α \in {2.0, 2.5, 3.0, 3.5, 4.0}$ for CIFAR 100. For $S G L_{1}$ and nonconvex sparse group lasso, we set $β = 15 α / 50000$ for Resnet-40 and $β = 25 α / 50000$ for WRN-28-10. For every 20 epochs, β increases by a factor of $σ = 1.25$ . The networks are trained for 200 epochs across 5 runs. We excluded $ℓ_{0}$ regularization by Louizos et al. [54] because it was unstable for the provided α’s. Furthermore, we only analyze the models with the lowest test errors since the test errors did not stabilize by the end of the 200 epochs in our experiments.

Table 6 reports mean test error, weight sparsity, and neuron sparsity across the Resnet-40 models trained on CIFAR 10 with the lowest test errors from the five runs. Group lasso has the lowest test errors for all α’s provided while CGES, $S G L_{1}$ , and nonconvex sparse group lasso are higher by at most 1.1%. When $α \leq 1.5$ , CGES has the largest weight sparsity while $S G S C A D$ , $S G T L_{1}$ $S G L_{1} - S G L_{2}$ have larger weight sparsity than does group lasso. At $α = 2.0, 2.5$ , $S G S C A D$ has the largest weight sparsity. At $α = 3.0$ , $S G L_{1}$ has the largest weight sparsity with comparable test error as the nonconvex group lasso regularizers. For neuron sparsity, $S G L_{1} - L_{2}$ has the largest at $α = 1.0$ while $S G S C A D$ has the largest at $α = 1.5, 2.0$ . However, at $α = 2.5, 3.0$ , group lasso has the largest neuron sparsity. For all α’s tested, $S G S C A D$ has higher weight sparsity and neuron sparsity than does $S G L_{1}$ but with comparable test error.

TABLE 6

TABLE 6. Average test error, weight sparsity, and neuron sparsity of Resnet-40 models trained on CIFAR 10 with lowest test errors across 5 runs. Standard deviations are in parentheses.

Table 7 reports mean test error, weight sparsity, and neuron sparsity across the Resnet-40 models trained on CIFAR 100 with the lowest test errors from the five runs. Group lasso has the lowest test errors for $α \leq 3.5$ while CGES has the lowest test error at $α = 4.0$ . However, the weight sparsity and the neuron sparsity of group lasso are lower than the sparsity of $S G L_{1}$ and some of the nonconvex sparse group lasso regularizers. CGES has the lowest neuron sparsity across all α’s. Among the nonconvex group lasso penalties, $S G S C A D$ has the best test errors, which are lower than the test errors of $S G L_{1}$ for all α’s except 2.5.

TABLE 7

TABLE 7. Average test error, weight sparsity, and neuron sparsity of Resnet-40 models trained on CIFAR 100 with lowest test errors across 5 runs. Standard deviations are in parentheses.

Table 8 reports mean test error, weight sparsity, and neuron sparsity across the WRN-28-10 models trained on CIFAR 10 with the lowest test errors from the five runs. The best test errors are attained by $S G T L_{1}$ at $α = 0.05, 0.2, 0.5$ ; by CGES at $α = 0.01$ ; and by $S G L_{1}$ at $α = 0.1$ . Weight sparsity of CGES outperforms the other methods only when $α = 0.01, 0.05, 0.1$ , but it underperforms when $α \geq 0.2$ . Weight sparsity levels between group lasso and nonconvex group lasso are comparable across all α. For neuron sparsity, $S G L_{1} - L_{2}$ attains the largest values at $α = 0.02, 0.1, 0.2$ . Nevertheless, the other nonconvex sparse group lasso methods have comparable neuron sparsity. Overall, $S G L_{1}$ , $S G L_{0}$ , $S G S C A D$ , and $S G T L_{1}$ outperform group lasso in test error while having similar or higher weight and neuron sparsity.

TABLE 8

TABLE 8. Average test error, weight sparsity, and neuron sparsity of WRN-28-10 models trained on CIFAR 10 with lowest test errors across 5 runs. Standard deviations are in parentheses.

Table 9 reports mean test error, weight sparsity, and neuron sparsity across the WRN-28-10 models trained on CIFAR 100 with the lowest test errors from the five runs. According to the results, the best test errors are attained by CGES when $α = 0.01, 0.05$ ; by $S G S C A D$ when $α = 0.1, 0.5$ ; and by $S G T L_{1}$ when $α = 0.2$ . Although CGES has the largest weight sparsity for $α = 0.01, 0.05, 0.1, 0.2$ , we see that its test error increases as α increases. When $α = 0.5$ , the best weight sparsity is attained by $S G S C A D$ , but the other methods have comparable weight sparsity. The best neuron sparsity is attained by CGES at $α = 0.01, 0.02$ ; by $S G L_{1} - L_{2}$ at $α = 0.1, 0.2$ ; and by $S G S C A D$ at $α = 0.5$ . The neuron sparsity among the nonconvex sparse group lasso methods are comparable. For $α \leq 0.2$ , we see that $S G L_{1}$ and nonconvex sparse group lasso outperform group lasso in test error across α while having comparable weight and neuron sparsity.

TABLE 9

TABLE 9. Average test error, weight sparsity, and neuron sparsity of WRN-28-10 models trained on CIFAR 100 with lowest test errors across 5 runs. Standard deviations are in parentheses.

3.2 Algorithm Comparison

We compare the proposed Algorithm 1 with direct stochastic gradient descent, where the gradient of the regularizer is approximated by backpropagation, and proximal gradient descent, discussed in Section 2.4, by applying them to $S G L_{1}$ on Lenet-5 trained on MNIST. The parameter setting for this CNN is discussed in Section 3.1.1. Table 10 reports the mean results for test error, weight sparsity, and neuron sparsity across five models trained after 200 epochs while Figure 2 provides visualizations. Table 11 and Figure 3 record mean statistics for models with the lowest test errors from the five runs. According to the results, proximal stochastic gradient descent attains the highest level of weight sparsity and neuron sparsity for models trained after 200 epochs and models with the lowest test error. However, their test errors are the highest among the three algorithms. On the other hand, our proposed algorithm attains the lowest test errors. For models trained after 200 epochs, the weight sparsity and neuron sparsity attained by Algorithm 1 are comparable to the sparsity attained by direct stochastic gradient descent. For models with the lowest test errors generated from their respective runs, the weight sparsity and neuron sparsity by the proposed algorithm are better than the sparsity by direct stochastic gradient descent. Therefore, our proposed algorithm generates the most accurate model with satisfactory sparsity among the three algorithms for sparse regularization.

TABLE 10

TABLE 10. Average test error, weight sparsity, and neuron sparsity of ${SGL}_{1}$ -regularized Lenet-5 models trained on MNIST after 200 epochs across 5 runs.

FIGURE 2

FIGURE 2. Mean results of algorithms applied to SGL₁ for Lenet-5 models trained on MNIST for 200 epochs across 5 runs when varying the regularization parameter $λ = α / 60000$ when $α \in {0.1, 0.2, 0.3, 0.4, 0.5}$ . (A) Mean test error. (B) Mean weight sparsity. (C) Mean neuron sparsity.

TABLE 11

TABLE 11. Average test error, weight sparsity, and neuron sparsity of ${SGL}_{1}$ -regularized Lenet-5 models trained on MNIST with lowest test errors across 5 runs.

FIGURE 3

FIGURE 3. Mean results of algorithms applied to SGL₁ for Lenet-5 models trained on MNIST with lowest test errors across 5 runs when varying the regularization parameter $λ = α / 60000$ when $α \in {0.1, 0.2, 0.3, 0.4, 0.5}$ . (A) Mean test error. (B) Mean weight sparsity. (C) Mean neuron sparsity.

4 Conclusion and Future Work

In this work, we propose nonconvex sparse group lasso, a nonconvex extension of sparse group lasso. The $ℓ_{1}$ norm in sparse group lasso on the weight parameters is replaced with a nonconvex regularizer whose proximal operator is a thresholding function. Taking advantage of this property, we develop a new algorithm to optimize loss functions regularized with nonconvex sparse group lasso for CNNs in order to attain a sparse network with competitive accuracy. We compare the proposed family of regularizers with various baseline methods on MNIST and CIFAR 10/100 on different CNNs. The experimental results demonstrate that in general, nonconvex sparse group lasso generates a more accurate and/or more compressed CNN than does group lasso. In addition, we compare our proposed algorithm to direct stochastic gradient descent and proximal gradient descent on Lenet-5 trained on MNIST. The results show that the proposed algorithm to solve $S G L_{1}$ yields a satisfactorily sparse network with lower test error than do the other two algorithms.

According to the numerical results, there is no single sparse regularizer that outperforms all other on any CNN trained on a given dataset. One regularizer may perform well in one case while it may perform worse on a different case. Due to the myriad of sparse regularizers to select from and the various parameters to tune, especially for one CNN trained on a given dataset, one direction is to develop an automatic machine learning framework that efficiently selects the right regularizer and parameters. In recent works, automatic machine learning can be represented as a matrix completion problem [88] and a statistical learning problem [24]. These frameworks can be adapted for selecting the best sparse regularizer, thus saving time for users who are training sparse CNNs.

5 Proofs

We provide proofs for the results discussed in Section 2.5.

5.1 Proof of Theorem 2

By Eqs 17a and 17b, for each $k \in ℕ$ , we have

F_{β} (V^{k}, W^{k + 1}) \leq F_{β} (V^{k}, W) (22)

for all W, and

F_{β} (V^{k + 1}, W^{k + 1}) \leq F_{β} (V, W^{k + 1}) (23)

for all V. By Eq. 23, we have

F_{β} (V^{+}, W^{+}) \leq F_{β} (V^{k}, W^{+}) (24)

for each $k \in ℕ$ . Altogether, we have

F_{β} (V^{+}, W^{+}) \leq F_{β} (V^{k}, W^{k}) (25)

for each $k \in ℕ$ , so ${F_{β} (V^{k}, W^{k})}_{k = 1}^{\infty}$ is nonincreasing. Since $F_{β} (V^{k}, W^{k}) \geq 0$ for all $k \in ℕ$ , its limit $lim_{k \to \infty} F_{β} (V^{k}, W^{k})$ exists. From Eqs. 22–24, we have

F_{β} (V^{+}, W^{+}) \leq F_{β} (V^{k}, W^{+}) \leq F_{β} (V^{k}, W^{k}) .

Taking the limit gives us

lim_{k \to \infty} F_{β} (V^{k}, W^{+}) = lim_{k \to \infty} F_{β} (V^{k}, W^{k}) . (26)

Since $(V^{*}, W^{*})$ is an accumulation point of ${(V^{k}, W^{k})}_{k = 1}^{\infty}$ , there exists a subsequence K such that

lim_{k \in K \to \infty} (V^{k}, W^{k}) = (V^{*}, W^{*}) . (27)

Because $r (\cdot)$ is lower semicontinuous and $lim_{k \in K \to \infty} V^{k} = V^{*}$ , there exists $k^{'} \in K$ such that $k \geq k^{'}$ implies $r (V_{l}^{k}) \geq r (V_{l}^{*})$ for each $l = 1, \dots, L$ . Using this result along with Eq. 23, we obtain

\begin{matrix} F_{β} (V, W^{k}) \geq F_{β} (V^{k}, W^{k}) \\ = \tilde{ℒ} (W^{k}) + \overset{L}{\sum_{l = 1}} [λ (ℛ_{G L} (W_{l}^{k}) + r (V_{l}^{k})) + \frac{β}{2} {| | V_{l}^{k} - W_{l}^{k} | |}_{2}^{2}] \\ \geq \tilde{ℒ} (W^{k}) + \overset{L}{\sum_{l = 1}} [λ (ℛ_{G L} (W_{l}^{k}) + r (V_{l}^{*})) + \frac{β}{2} {| | V_{l}^{k} - W_{l}^{k} | |}_{2}^{2}] \end{matrix}

for $k \geq k^{'}$ . As $k \in K \to \infty$ , we have

F_{β} (V, W^{*}) \geq \tilde{ℒ} (W^{*}) + \sum_{l = 1}^{L} [λ (ℛ_{G L} (W_{l}^{*}) + r (V_{l}^{*})) + \frac{β}{2} {| | V_{l}^{*} - W_{l}^{*} | |}_{2}^{2}] = F_{β} (V^{*}, W^{*}) (28)

by continuity, so it follows that $V^{*} \in {arg min}_{V} F_{β} (V, W^{*})$ .

For notational convenience, let

{\tilde{ℛ}}_{λ, β} (V, W) : = \sum_{l = 1}^{L} [λ ℛ_{G L} (W_{l}) + \frac{β}{2} {| | V_{l} - W_{l} | |}_{2}^{2}] . (29)

By Eq. 22, we have

\begin{matrix} \tilde{ℒ} (W) + {\tilde{ℛ}}_{λ, β} (V^{k}, W) = F_{β} (V^{k}, W) - λ \overset{L}{\sum_{i = 1}} r (V_{l}^{k}) \\ \geq F_{β} (V^{k}, W^{+}) - λ \overset{L}{\sum_{i = 1}} r (V_{l}^{k}) = \tilde{ℒ} (W^{+}) + {\tilde{ℛ}}_{λ, β} (V^{k}, W^{+}) . \end{matrix} (30)

Because $lim_{k \in K \to \infty} V^{k}$ exists, the sequence ${V^{k}}_{k \in K}$ is bounded. If $r (\cdot)$ is $ℓ_{0}$ , transformed $ℓ_{1}$ , or SCAD, then ${r (V^{k})}_{k \in K}$ is bounded. If $r (\cdot)$ is $ℓ_{1}$ , then $r (\cdot)$ is coercive. If $r (\cdot)$ is $ℓ_{1} - α ℓ_{2}$ , then $r (\cdot)$ is bounded above by $ℓ_{1}$ . Overall, this follows that ${r (V^{k})}_{k \in K}$ bounded as well. Hence, there exists a further subsequence $\bar{K} \subset K$ such that $lim_{k \in \bar{K} \to \infty} r (V^{k})$ exists. So, we obtain

\begin{matrix} lim_{k \in \bar{K} \to \infty} \tilde{ℒ} (W^{+}) + {\tilde{ℛ}}_{λ, β} (V^{k}, W^{+}) = lim_{k \in \bar{K} \to \infty} F_{β} (V^{k}, W^{+}) - λ \overset{L}{\sum_{i = 1}} r (V_{l}^{k}) \\ = lim_{k \in \bar{K} \to \infty} F_{β} (V^{k}, W^{+}) - lim_{k \in \bar{K} \to \infty} λ \overset{L}{\sum_{i = 1}} r (V_{l}^{k}) \\ = lim_{k \in \bar{K} \to \infty} F_{β} (V^{k}, W^{k}) - lim_{k \in \bar{K} \to \infty} λ \overset{L}{\sum_{i = 1}} r (V_{l}^{k}) \\ = lim_{k \in \bar{K} \to \infty} F_{β} (V^{k}, W^{k}) - λ \overset{L}{\sum_{i = 1}} r (V_{l}^{k}) \\ = lim_{k \in \bar{K} \to \infty} \tilde{ℒ} (W^{k}) + {\tilde{ℛ}}_{λ, β} (V^{k}, W^{k}) \\ = \tilde{ℒ} (W^{*}) + {\tilde{ℛ}}_{λ, β} (W^{*}, V^{*}) \end{matrix} (31)

after applying Eq. 26 in the third inequality and by continuity in the last equality.

Taking the limit over the subsequence $\bar{K}$ in Eq. 30 and applying Eq. 31, we obtain

\tilde{ℒ} (W) + {\tilde{ℛ}}_{λ, β} (V^{*}, W) \geq \tilde{ℒ} (W^{*}) + {\tilde{ℛ}}_{λ, β} (W^{*}, V^{*}) (32)

by continuity. Adding ${\sum^{}}_{l = 1}^{L} r (V_{l}^{*})$ on both sides yields

F_{β} (V^{*}, W) \geq F_{β} (V^{*}, W^{*}), (33)

which follows that $W^{*} \in {arg min}_{W} F_{β} (V^{*}, W)$ . This completes the proof.

5.2 Proof of Theorem 3

Because $(V^{*}, W^{*})$ is an accumulation point, there exists a subsequence K such that $lim_{k \in K \to \infty} (V^{k}, W^{k}) = (V^{*}, W^{*})$ . If ${F_{β_{k}} (V^{k}, W^{k})}_{k = 1}^{\infty}$ is uniformly bounded, there exists M such that $F_{β_{k}} (V^{k}, W^{k}) \leq M$ for all $k \in ℕ$ . Then we have

M \geq F_{β_{k}} (V^{k}, W^{k}) = \tilde{ℒ} (W) + \sum_{l = 1}^{L} [λ ℛ_{G L} (W_{l}) + λ r (V_{l}) + \frac{β_{k}}{2} {| | V_{l} - W_{l} | |}_{2}^{2}] \geq \frac{β_{k}}{2} \sum_{l = 1}^{L} {| | V_{l} - W_{l} | |}_{2}^{2}

As a result,

\sum_{l = 1}^{L} {| | V_{l}^{k} - W_{l}^{k} | |}_{2}^{2} \leq \frac{2}{β_{k}} M . (34)

Taking the limit over $k \in K$ , we have

\sum_{l = 1}^{L} {| | V_{l}^{*} - W_{l}^{*} | |}_{2}^{2} = 0,

which follows that $V^{*} = W^{*}$ . As a result, $(V^{*}, W^{*})$ is a feasible solution to Eq. 15.

Data Availability Statement

The datasets MNIST and CIFAR 10/100 for this study are available through the Pytorch package in Python. Codes for the numerical experiments in Section 3 are available at https://github.com/kbui1993/Official_Nonconvex_SGL.

Author Contributions

KB and FP performed the experiments and analysis. All authors contributed to the design, evaluation, discussions and production of the manuscript.

Funding

The work was partially supported by NSF grants IIS-1632935, DMS-1854434, DMS-1924548, DMS-1952644 and the Qualcomm Faculty Award.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

The authors would like to thank Thu Dinh for helpful conversations. They also thank Christos Louizos for answering our questions we had regarding his work in [54]. Lastly, the authors thank AWS Cloud Credits for Research and Google Cloud Platform (GCP) for providing cloud based computational resources for this work.

References

1. Aghasi, A, Abdi, A, Nguyen, N, and Romberg, J. Net-trim: convex pruning of deep neural networks with performance guarantee. In: Advances in Neural Information Processing Systems; 2017 Nov 23; Long Beach, CA. Pasadena, CA: NeurIPS (2017). p. 3177–86. doi:10.5555/3294996.3295077

PubMed Abstract | CrossRef Full Text | Google Scholar

2. Aghasi, A, Abdi, A, and Romberg, J. Fast convex pruning of deep neural networks. SIAM J Math Data Sci (2020). 2:158–188. doi:10.1137/19m1246468

PubMed Abstract | CrossRef Full Text | Google Scholar

3. Ahn, M, Pang, J-S, and Xin, J. Difference-of-convex learning: directional stationarity, optimality, and sparsity. SIAM J Optim. (2017). 27:1637–1665. doi:10.1137/16m1084754

PubMed Abstract | CrossRef Full Text | Google Scholar

4. Alvarez, JM, and Salzmann, M. Learning the number of neurons in deep networks In: Advances in Neural Information Processing Systems; 2018 Oct 11; Barcelona, Spain. Pasadena, CA: NeurIPS (2016). p. 2270–8.