Structured Sparsity of Convolutional Neural Networks via Nonconvex Sparse Group Regularization

Convolutional neural networks (CNN) have been hugely successful recently with superior accuracy and performance in various imaging applications, such as classification, object detection, and segmentation. However, a highly accurate CNN model requires millions of parameters to be trained and utilized. Even to increase its performance slightly would require significantly more parameters due to adding more layers and/or increasing the number of filters per layer. Apparently, many of these weight parameters turn out to be redundant and extraneous, so the original, dense model can be replaced by its compressed version attained by imposing inter- and intra-group sparsity onto the layer weights during training. In this paper, we propose a nonconvex family of sparse group lasso that blends nonconvex regularization (e.g., transformed ℓ 1 , ℓ 1 − ℓ 2 , and ℓ 0 ) that induces sparsity onto the individual weights and ℓ 2,1 regularization onto the output channels of a layer. We apply variable splitting onto the proposed regularization to develop an algorithm that consists of two steps per iteration: gradient descent and thresholding. Numerical experiments are demonstrated on various CNN architectures showcasing the effectiveness of the nonconvex family of sparse group lasso in network sparsification and test accuracy on par with the current state of the art.


INTRODUCTION
Deep neural networks (DNNs) have proven to be advantageous for numerous modern computer vision tasks involving image or video data. In particular, convolutional neural networks (CNNs) yield highly accurate models with applications in image classification [28,39,77,95], semantic segmentation [13,49], and object detection [30,72,73]. These large models often contain millions of weight parameters that often exceed the number of training data. This is a double-edged sword since on one hand, large models allow for high accuracy, while on the other, they contain many redundant parameters that lead to overparametrization. Overparametrization is a wellknown phenomenon in DNN models [6,17] that results in overfitting, learning useless random patterns in data [96], and having inferior generalization. Additionally, these models also possess exorbitant computational and memory demands during both training and inference. Consequently, they may not be applicable for devices with low computational power and memory.
Resolving these problems requires compressing the networks through sparsification and pruning. Although removing weights might affect the accuracy and generalization of the models, previous works [25,54,66,81] demonstrated that many networks can be substantially pruned with negligible effect on accuracy. There are many systematic approaches to achieving sparsity in DNNs, as discussed extensively in Refs. 14 and 15.
Han et al. [26] proposed to first train a dense network, prune it afterward by setting the weights to zeroes if below a fixed threshold, and retrain the network with the remaining weights. Jin et al. [32] extended this method by restoring the pruned weights, training the network again, and repeating the process. Rather than pruning by thresholding, Aghasi et al. [1,2] proposed Net-Trim, which prunes an already trained network layer by layer using convex optimization in order to ensure that the layer inputs and outputs remain consistent with the original network. For CNNs in particular, filter or channel pruning is preferred because it significantly reduces the amount of weight parameters required compared to individual weight pruning. Li et al. [43] calculated the sums of absolute weights of the filters of each layer and pruned the ones with the smallest sums. Hu et al. [29] proposed a metric called average percentage of zeroes for channels to measure their redundancies and pruned those with highest values for each layer. Zhuang et al. [105] developed discrimination-aware channel pruning that selects channels that contribute to the network's discriminative power.
An alternative approach to pruning a dense network is learning a compressed structure from scratch. A conventional approach is to optimize the loss function equipped with either the ℓ 1 or ℓ 2 regularization, which drives the weights to zeroes or to very small values during training. To learn which groups of weights (e.g., neurons, filters, channels) are necessary, group regularization, such as group lasso [93] and sparse group lasso [76], are equipped to the loss function. Alvarez and Salzmann [4] and Scardapane et al. [75] applied group lasso and sparse group lasso to various architectures and obtained compressed networks with comparable or even better accuracy. Instead of sharing features among the weights as suggested by group sparsity, exclusive sparsity [104] promotes competition for features between different weights. This method was investigated by Yoon and Hwang [92]. In addition, they combined it with group sparsity and demonstrated that this combination resulted in compressed networks with better performance than their original counterparts. Non-convex regularization has also been examined. Louizos et al. [54] proposed a practical algorithm using probabilistic methods to perform ℓ 0 regularization on CNNs. Ma et al. [61] proposed integrated transformed ℓ 1 , a convex combination of transformed ℓ 1 and group lasso, and compared its performance against the aforementioned group regularization methods.
In this paper, we propose a family of group regularization methods that balances both group lasso for group-wise sparsity and nonconvex regularization for element-wise sparsity. The family extends sparse group lasso by replacing the ℓ 1 penalty term with a nonconvex penalty term. The nonconvex penalty terms considered are ℓ 0 , ℓ 1 − αℓ 2 , transformed ℓ 1 , and SCAD. The proposed family is supposed to yield a more accurate and/or more compressed network than sparse group lasso since ℓ 1 suffers various weaknesses due to being a convex relaxation of ℓ 0 . We develop an algorithm to optimize loss functions equipped with the proposed nonconvex, group regularization terms for DNNs.

Preliminaries
Given a training dataset consisting of N input-output pairs {(x i , y i )} N i 1 , the weight parameters of a DNN are learned by optimizing the following objective function: where • W is the set of weight parameters of the DNN.
• h(·, ·) is the output of the DNN used for prediction.
• L(·, ·) ≥ 0 is the loss function that compares the prediction h(x i , W) with the ground-truth output y i . Examples include cross-entropy loss function for classification and meansquared error for regression. • R(·) is the regularizer on the set of weight parameters W. • λ > 0 is a regularization parameter for R(·).
The most common regularizer used for DNNs is ℓ 2 regularization · 2 2 , also known as weight decay. It prevents overfitting and improves generalization because it enforces the weights to decrease proportionally to their magnitudes [40]. Sparsity can be imposed by pruning weights whose magnitudes are below a certain threshold at each iteration during training. However, an alternative regularizer is the ℓ 1 norm · 1 , also known as the lasso penalty [78]. The ℓ 1 norm is the tightest convex relaxation of the ℓ 0 penalty [20,23,82] and it yields a sparse solution that is found on the corners of the 1-norm ball [27,52]. Theoretical results justify the ℓ 1 norm's ability to reconstruct sparse solution in compressed sensing. When a sensing matrix satisfies the restricted isometry property, the ℓ 1 norm recovers the sparse solution exactly with high probability [11,23,82]. On the other hand, the null space property is a necessary and sufficient condition for ℓ 1 minimization to guarantee exact recovery of sparse solutions [16,23]. Being able to yield sparse solutions, the ℓ 1 norm has gained popularity in other types of inverse problems such as compressed imaging [33,57] and image segmentation [34,35,42] and in various fields of applications such as geoscience [74], medical imaging [33,57], machine learning [10,36,67,78,89], and traffic flow network [91]. Unfortunately, element-wise sparsity by ℓ 1 or ℓ 2 regularization in CNNs may not yield meaningful speedup as the number of filters and channels required for computation and inference may remain the same [86].
To determine which filters or channels are relevant in each layer, group sparsity using the group lasso penalty [93] is considered. The group lasso penalty has been utilized in various applications, such as microarray data analysis [62], machine learning [7,65], and EEG data [46]. Suppose a DNN has L layers, so the set of weight parameters W is divided into L sets of weights: W {W l } L l 1 . The weight set of each layer W l is divided into N l groups (e.g., channels or filters): W l {w l,g } N l g 1 . The group lasso penalty applied to W l is formulated as where w l,g,i corresponds to the weight parameter with index i in group g in layer l and the term #w l,g denotes the number of weight parameters in group g in layer l. Because group sizes vary, the constant #w l,g is multiplied in order to rescale the ℓ 2 norm of each group with respect to the group size, ensuring that each group is weighed uniformly [65,76,93]. The group lasso regularizer imposes the ℓ 2 norm on each group, forcing weights of the same groups to decrease altogether at every iteration during training. As a result, the groups of weights are pruned when their ℓ 2 norms are negligible, resulting in a highly compact network compared to elementsparse networks.
As an alternative to group lasso that encourages feature sharing, exclusive sparsity [104] enforces the model weight parameters to compete for features, making the features discriminative for each class in the context of classification. The regularization for exclusive sparsity is Now, within each group, sparsity is enforced. Because exclusivity cannot guarantee the optimal features since some features do need to be shared, exclusive sparsity can be combined with group sparsity to form combined group and exclusive sparsity (CGES) [92]. CGES is formulated as where μ l ∈ (0, 1) is a parameter for balancing exclusivity and sharing among features.
To obtain an even sparser network, element-wise sparsity and group sparsity can be combined and applied together to the training of DNNs. One regularizer that combines these two types of sparsity is the sparse group lasso penalty [76], which is formulated as Sparse group lasso simultaneously enforces group sparsity by having the regularizer R GL (·) and element-wise sparsity by having the ℓ 1 norm. This regularizer has been used in machine learning [83], bioinformatics [48,103], and medical imaging [47]. Figure 1 demonstrates the differences between lasso, group lasso, and sparse group lasso applied to a weight matrix connecting a 5-dimensional input layer to a 10-dimensional output layer. In white, the entries are zero'ed out; in gray; the entries are not. Unlike lasso, group lasso results in a more structured method of pruning since three of the five neurons can be zero'ed out. Combined with ℓ 1 regularization on the individual weights, sparse group lasso allows for more weights in the remaining two neurons to be pruned.

Nonconvex Sparse Group Lasso
We recall that the ℓ 1 norm is the tightest convex relaxation of the ℓ 0 penalty, given by when applied to the weight set W l of layer l. The ℓ 0 penalty is nonconvex and discontinuous. In addition, any ℓ 0 -regularized problem is NP-hard [23]. These properties make developing convergent and tractable algorithms for ℓ 0 -regularized problems difficult, thereby making ℓ 1 -regularized problems better alternatives to solve. However, the ℓ 0 -regularized problems have been shown to recover better solutions in terms of sparsity and/or accuracy than do ℓ 1 -regularized problems in various applications, such as compressed sensing [56], image restoration [8,12,19,55,102], MRI reconstruction [80], and machine learning [56,94]. In particular, ℓ 0 -regularized inverse problems were demonstrated to be more robust against Poisson noise than are ℓ 1 -regualarized inverse problems [100]. A continuous alternative to the ℓ 0 penalty is the SCAD penalty term [22,58], given by where λ|w| SCAD(a) : for λ > 0 and a > 2. This penalty term enjoys three propertiesunbiasedness, sparsity, and continuitywhile the ℓ 1 norm, on the other hand, has only sparsity and continuity [22]. In linear and logistic regression, SCAD was shown to outperform ℓ 1 in variable selection [22]. SCAD has been applied to wavelet approximation [5], bioinformatics [9,84], and compressed sensing [64]. The transformed ℓ 1 penalty term [68] also enjoys the properties of unbiasedness, sparsity, and continuity [58]. In fact, the regularizer is not just continuous but Lipschitz continuous [98]. The term is given by where |w| TL1(a) (a + 1)|w| a + |w| .
The transformed ℓ 1 penalty term was investigated and was shown to outperform ℓ 1 in compressed sensing [79,97,98], deep learning [45,61,87], matrix completion [99], and epidemic forecasting [45]. Another Lipschitz continuous, nonconvex regularizer is the ℓ 1 − αℓ 2 penalty given by where α ∈ (0, 1]. In a series of works [50][51][52]90], the penalty term ℓ 1 − ℓ 2 with α 1 yields better solutions than does ℓ 1 in various compressed sensing applications especially when the sensing matrix is highly coherent or it violates the restricted isometry property condition. To guarantee exact recovery of sparse solution, ℓ 1 − ℓ 2 only requires a relaxed variant of the null space property [79]. Furthermore, ℓ 1 − αℓ 2 is more robust against impulsive noise in yielding sparse, accurate solutions for inverse problems than is ℓ 1 [44]. Besides compressed sensing, it has been utilized in image FIGURE 1 | Comparison between lasso, group lasso, and sparse group lasso applied to a weight matrix. Entries in white are zero'ed out or removed; entries in gray remain.
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org February 2021 | Volume 6 | Article 529564 denoising and deblurring [53], image segmentation [71], image inpainting [63], and hyperspectral demixing [21]. In deep learning application, the ℓ 1 − ℓ 2 regularization was used to learn permutation matrices [59] for ShuffleNet [60,101]. Due to the advantages and recent successes of the aforementioned nonconvex regularizers, we propose to replace the ℓ 1 norm in Eq. 5 with nonconvex penalty terms. Hence, we propose a family of group regularizers called nonconvex sparse group lasso. The family includes the following: Using these regularizers, we expect to obtain a sparser and/or more accurate network than from using the original sparse group lasso. The ℓ 1 norm can also be replaced with other nonconvex penalties not mentioned in this paper. Refer to Refs. 3 and 85 to see other nonconvex penalties. However, we focus on the aforementioned nonconvex regularizers because they have closed-form proximal operators required by our proposed algorithm described in the next section.

Notations and Definitions
Before discussing the algorithm, we summarize notations that we will use to save space. They are the following: In addition, we define the proximal operator for the regularization function r(·) as follows: for λ > 0.

Numerical Optimization
We develop a general algorithm framework to solve where W {W l } L l 1 , R is either R SGL 1 or one of the nonconvex regularizers Eqs. 10-13, and r(·) is the corresponding sparsityinducing regularizer. Throughout the paper, our assumption on Eq. 14 is the following: ASSUMPTION 1. The functionL is continuously differentiable with respect to W l for each l 1, . . . , L.
By introducing an auxiliary variable V {V l } L l 1 for (14), we have a constrained optimization problem: The constraints can be relaxed by adding the quadratic penalty terms with β > 0 so that we have With β fixed, Eq. 16 can be solved by alternating minimization: To solve Eq. 17a, we simultaneously update W l for l 1, . . . L by gradient descent where c > 0 is the learning rate and z W l R GL is the subdifferential of R GL with respect to W l . In practice, Eq. 18 is performed using stochastic gradient descent (or one of its variants) with mini-batches due to the large-size computation dealing with the amount of data and weight parameters that a typical DNN has.
To update V, we see that Eq. 17b can be rewritten as The proximal operators for the considered regularizers are thresholding functions as their closed-form solutions, and as a result, the V update simplifies to thresholding W. The regularization functions and their corresponding proximal operators are summarized in Table 1.
Incorporating the algorithm that solves the quadratic penalty problem Eq. 16, we now develop a general algorithm to solve Eq. 14. We solve a sequence of quadratic penalty problems Eq. 16 (14). This algorithm is based on the quadratic penalty method [69] and the penalty decomposition method [56]. The algorithm is summarized in Algorithm 1.
An alternative algorithm to solve Eq. 14 is proximal gradient descent [70]. By this method, the update for W l , l 1, . . . , L, is Using this algorithm results in weight parameters with some already zero'ed out. However, the advantage of our proposed algorithm lies in Eq. 17a, written more specifically as We see that this step performs exact weight decay or ℓ 2 regularization on weights w l,i whenever v l,i 0. On the other hand, when v l,i ≠ 0, the effect of ℓ 2 regularization is mitigated on the corresponding weight w l,i based on the absolute difference v l,i − w l,i . Using ℓ 2 regularization was shown to give superior pruning results in terms of accuracy by Han et al. [26]. Our proposed algorithm can be perceived as an adaptive ℓ 2 regularization method, where Eq. 17b identifies which weights to perform exact ℓ 2 regularization on and Eq. 17a updates and regularizes the weights accordingly.

Convergence Analysis
To establish convergence for the proposed algorithm, the results below state that the accumulation point of the sequence generated by Eqs 17a and 17b is a block-coordinate minimizer, and an accumulation point generated by Algorithm 1 is a sparse feasible solution to (15). Proofs are provided in Section 5. Unfortunately, the feasible solution generated may not be a local minimizer of Eq. 15 because the loss function L(·, ·) is nonconvex. However, it was shown in [18] that a similar algorithm to Algorithm 1, but for fixed β in a bounded interval, generates an approximate global solution with high probability for a one-layer CNN with ReLu activation function.
Remark: To safely ensure that {F β k (V k , W k )} ∞ k 1 is uniformly bounded in practice, we can find a feasible solution (V feas , W feas ) to (15) and impose a bound M such that

Application to Deep Neural Networks
We compare the proposed nonconvex sparse group lasso against four other methods as baselines: group lasso, sparse group lasso (SGL 1 ), CGES proposed in Ref. 92, and the group variant of ℓ 0 regularization (denoted as ℓ 0 for simplicity) proposed in Ref. 54. SGL 1 is optimized using the same algorithm proposed for nonconvex sparse group lasso. For the group terms, the weights are grouped together based on the filters or output channels, which we will refer to as neurons. We trained various CNN architectures on MNIST [41] and CIFAR 10/100 [38]. The MNIST dataset consists of 60k training images and 10k test images. MNIST is trained on two simple CNN architectures: LeNet-5-Caffe [31,41] and a 4-layer CNN with two convolutional layers (32 and 64 channels, respectively) and an intermediate layer of 1000 fully connected neurons. CIFAR 10/100 is a dataset that has 10/100 classes split into 50k training images and 10k test images. It is trained on Resnets [28] and wide Resnets [95]. Throughout all of our experiments, for SGSCAD(a), we set a 3.7 as suggested in [22]; for SGTL 1 (a), we set a 1.0 as suggested in Ref. 99; and for SGL 1 − L 2 , we set α 1.0 as suggested by the literatures [50][51][52]90]. For CGES, we have μ l l/L. Because the optimization algorithms do not drive most, if not all, the weights and neurons to zeroes, we have to set them to zeroes when their values are below a certain threshold. In our experiments, if the absolute weights are below 10 − 5 , we set them to zeroes. Then, weight sparsity is defined to be the percentage of zero weights with respect to the total number of weights trained in the network. If the normalized sum of the absolute values of the weights of the neuron is less than 10 − 5 , then the weights of the neuron are set to zeroes. Neuron sparsity is defined to be the percentage of neurons whose weights are zeroes with respect to the total number of neurons in the network.

MNIST Classification
MNIST is trained on Lenet-5-Caffe, which has four layers with 1,370 total neurons and 431,080 total weight parameters. All layers of the network are applied with strictly the same type of regularization. No other regularization methods (e.g., dropout and batch normalization) are used. The network is optimized using Adam [37] with initial learning rate 0.001. For every 40 epochs, the learning rate decays by a factor of 0.1. We set the regularization parameter to the following values: λ α/60000 for α ∈ {0.1, 0.2, 0.3, 0.4, 0.5}. For SGL 1 and nonconvex sparse group lasso, we set β 25α/60000, and for every 40 epochs, β increases by a factor of σ 1.25. The network is trained for 200 epochs across 5 runs. Table 2 reports the mean results for test error, weight sparsity, and neuron sparsity across five runs of Lenet-5-Caffe trained after 200 epochs. We see that although CGES has the lowest test errors at α ∈ {0.1, 0.3, 0.4} and the largest weight sparsity for all α ∈ {0.1, 0.2, . . . , 0.5}, nonconvex sparse group lasso's test errors and weight sparsity are comparable. Additionally, nonconvex sparse group lasso's neuron sparsity is nearly two times larger than the neuron sparsity attained by CGES. Across all parameters and methods, SGL 0 with α 0.5 attains the best average test error of 0.630 with average weight sparsity 95.7% and neuron sparsity 80.7%. Furthermore, its test error is lower than the test errors of other nonconvex sparse group lasso regularization methods for all α's tested. Generally, SGL 1 and nonconvex sparse group lasso outperform ℓ 0    regularization proposed by Louizos et al. [54] and group lasso by average weight and neuron sparsity. Table 3 reports the mean results for test error, weight sparsity, and neuron sparsity of the Lenet-5-Caffe models with the lowest test errors from the five runs. According to the results, the best test errors are attained by SGL 0 at α 0.3, 0.5; SGL 1 − L 2 at α 0.2; and CGES at α 0.1, 0.4. For average weight sparsity, SGL 0 attains the largest weight sparsity at α ∈ {0.2, 0.3, 0.4, 0.5}. For average neuron sparsity, the largest values are attained by SGTL 1 at α 0.1, 0.2; by SGL 1 at α 0.3; and by SGL 0 at α 0.4, 0.5. Although SGL 0 does not outperform all the other methods across the board, its results are still comparable to the best results. Overall, we see that nonconvex sparse group lasso outperforms ℓ 0 in test error, weight sparsity, and neuron sparsity and group lasso in weight and neuron sparsity.
MNIST is also trained on a 4-layer CNN with two convolutional layers with 32 and 64 channels, respectively, and an intermediate layer with 1000 neurons. Each convolutional layer has a 5 × 5 convolutional filters. The 4layer CNN has 2,120 total neurons and 1,087,010 total weight parameters. All layers of the network are applied with strictly the same type of regularization. The network is optimized with the same settings as Lenet-5-Caffe. However, the regularization parameter is different: we have λ α/60000 for α ∈ {0.2, 0.4, 0.6, 0.8, 1.0}. For SGL 1 and nonconvex sparse group lasso, we set β 5α/60000 and for every 40 epochs, β increases by a factor of σ 1.25. The network is trained for 200 epochs across 5 runs. Table 4 reports the mean results for test error, weight sparsity, and neuron sparsity across five runs of the 4-layer CNN models trained after 200 epochs. Although CGES consistently has the highest weight sparsity, it does not yield the most accurate models until when α ≥ 0.8. Moreover, its neuron sparsity is smaller than the neuron sparsity by group lasso, SGL 1 , and nonconvex group lasso when α ≥ 0.6. ℓ 0 has the highest neuron sparsity for all α's given, but its test errors are much greater. When α ≤ 0.6, SGSCAD yields the most accurate models at α 0.2, 0.6 while SGL 1 yields one at α 0.4. Overall, we see that nonconvex group lasso has comparable weight sparsity and neuron sparsity as group lasso and SGL 1 . Table 5 reports the mean results for test error, weight sparsity, and neuron sparsity of the 4-layer CNN models with the lowest test errors from the five runs. At α 0.2, SGL 1 and SGSCAD have the lowest test errors, but their weight sparsity are exceeded by CGES and their neuron sparsity are exceeded by ℓ 0 . At α 0.4, SGL 1 − L 2 has the lowest test error, but its weight sparsity and neuron sparsity are exceeded by CGES and ℓ 0 , respectively. At α 0.6, SGL 1 has the lowest test error, but SGSCAD has the largest weight sparsity with comparable test error. At α ≥ 0.8, CGES has the lowest test error, but its weight sparsity is exceeded by group lasso, SGL 1 , and the nonconvex group lasso regularizers, which all have slightly higher test error. At α 0.8, the neuron sparsity of CGES is comparable to the neuron sparsity of group lasso, SGL 1 , and the nonconvex group lasso regularizers. At α 1.0, group lasso has the highest neuron sparsity, but nonconvex group  lasso has slightly lower neuron sparsity. In general, weight sparsity of nonconvex group lasso is comparable to or larger than the weight sparsity of group lasso and SGL 1 .     trained for 200 epochs across 5 runs. We excluded ℓ 0 regularization by Louizos et al. [54] because it was unstable for the provided α's. Furthermore, we only analyze the models with the lowest test errors since the test errors did not stabilize by the end of the 200 epochs in our experiments. Table 6 reports mean test error, weight sparsity, and neuron sparsity across the Resnet-40 models trained on CIFAR 10 with the lowest test errors from the five runs. Group lasso has the lowest test errors for all α's provided while CGES, SGL 1 , and nonconvex sparse group lasso are higher by at most 1.1%. When α ≤ 1.5, CGES has the largest weight sparsity while SGSCAD, SGTL 1 SGL 1 − SGL 2 have larger weight sparsity than does group lasso. At α 2.0, 2.5, SGSCAD has the largest weight sparsity. At α 3.0, SGL 1 has the largest weight sparsity with comparable test error as the nonconvex group lasso regularizers. For neuron sparsity, SGL 1 − L 2 has the largest at α 1.0 while SGSCAD has the largest at α 1.5, 2.0. However, at α 2.5, 3.0, group lasso has the largest neuron sparsity. For all α's tested, SGSCAD has higher weight sparsity and neuron sparsity than does SGL 1 but with comparable test error. Table 7 reports mean test error, weight sparsity, and neuron sparsity across the Resnet-40 models trained on CIFAR 100 with  the lowest test errors from the five runs. Group lasso has the lowest test errors for α ≤ 3.5 while CGES has the lowest test error at α 4.0. However, the weight sparsity and the neuron sparsity of group lasso are lower than the sparsity of SGL 1 and some of the nonconvex sparse group lasso regularizers. CGES has the lowest neuron sparsity across all α's. Among the nonconvex group lasso penalties, SGSCAD has the best test errors, which are lower than the test errors of SGL 1 for all α's except 2.5. Table 8 reports mean test error, weight sparsity, and neuron sparsity across the WRN-28-10 models trained on CIFAR 10 with the lowest test errors from the five runs. The best test errors are attained by SGTL 1 at α 0.05, 0.2, 0.5; by CGES at α 0.01; and by SGL 1 at α 0.1. Weight sparsity of CGES outperforms the other methods only when α 0.01, 0.05, 0.1, but it underperforms when α ≥ 0.2. Weight sparsity levels between group lasso and nonconvex group lasso are comparable across all α. For neuron sparsity, SGL 1 − L 2 attains the largest values at α 0.02, 0.1, 0.2. Nevertheless, the other nonconvex sparse group lasso methods have comparable neuron sparsity. Overall, SGL 1 , SGL 0 , SGSCAD, and SGTL 1 outperform group lasso in test error while having similar or higher weight and neuron sparsity. Table 9 reports mean test error, weight sparsity, and neuron sparsity across the WRN-28-10 models trained on CIFAR 100 with the lowest test errors from the five runs. According to the results, the best test errors are attained by CGES when α 0.01, 0.05; by SGSCAD when α 0.1, 0.5; and by SGTL 1 when α 0.2. Although CGES has the largest weight sparsity for α 0.01, 0.05, 0.1, 0.2, we see that its test error increases as α increases. When α 0.5, the best weight sparsity is attained by SGSCAD, but the other methods have comparable weight  sparsity. The best neuron sparsity is attained by CGES at α 0.01, 0.02; by SGL 1 − L 2 at α 0.1, 0.2; and by SGSCAD at α 0.5. The neuron sparsity among the nonconvex sparse group lasso methods are comparable. For α ≤ 0.2, we see that SGL 1 and nonconvex sparse group lasso outperform group lasso in test error across α while having comparable weight and neuron sparsity.

Algorithm Comparison
We compare the proposed Algorithm 1 with direct stochastic gradient descent, where the gradient of the regularizer is approximated by backpropagation, and proximal gradient descent, discussed in Section 2.4, by applying them to SGL 1 on Lenet-5 trained on MNIST. The parameter setting for this CNN is discussed in Section 3.1.1. Table 10 reports the mean results for test error, weight sparsity, and neuron sparsity across five models trained after 200 epochs while Figure 2 provides visualizations.   the proposed algorithm are better than the sparsity by direct stochastic gradient descent. Therefore, our proposed algorithm generates the most accurate model with satisfactory sparsity among the three algorithms for sparse regularization.

CONCLUSION AND FUTURE WORK
In this work, we propose nonconvex sparse group lasso, a nonconvex extension of sparse group lasso. The ℓ 1 norm in sparse group lasso on the weight parameters is replaced with a nonconvex regularizer whose proximal operator is a thresholding function. Taking advantage of this property, we develop a new algorithm to optimize loss functions regularized with nonconvex sparse group lasso for CNNs in order to attain a sparse network with competitive accuracy. We compare the proposed family of regularizers with various baseline methods on MNIST and CIFAR 10/100 on different CNNs.
The experimental results demonstrate that in general, nonconvex sparse group lasso generates a more accurate and/or more compressed CNN than does group lasso. In addition, we compare our proposed algorithm to direct stochastic gradient descent and proximal gradient descent on Lenet-5 trained on MNIST. The results show that the proposed algorithm to solve SGL 1 yields a satisfactorily sparse network with lower test error than do the other two algorithms.
According to the numerical results, there is no single sparse regularizer that outperforms all other on any CNN trained on a given dataset. One regularizer may perform well in one case while it may perform worse on a different case. Due to the myriad of sparse regularizers to select from and the various parameters to tune, especially for one CNN trained on a given dataset, one direction is to develop an automatic machine learning framework that efficiently selects the right regularizer and parameters. In recent works, automatic machine learning can be represented as a matrix completion problem [88] and a statistical learning problem [24]. These frameworks can be adapted for selecting the best sparse regularizer, thus saving time for users who are training sparse CNNs.

PROOFS
We provide proofs for the results discussed in Section 2.5.

Proof of Theorem 2
By Eqs 17a and 17b, for each k ∈ N, we have for all W, and for all V. By Eq. 23, we have for each k ∈ N. Altogether, we have for each k ∈ N, so {F β (V k , W k )} ∞ k 1 is nonincreasing. Since F β (V k , W k ) ≥ 0 for all k ∈ N, its limit lim k → ∞ F β (V k , W k ) exists. From Eqs. 22-24, we have Taking the limit gives us Since (V * , W * ) is an accumulation point of {(V k , W k )} ∞ k 1 , there exists a subsequence K such that Because r(·) is lower semicontinuous and lim k ∈ K→ ∞ V k V * , there exists k ′ ∈ K such that k ≥ k ′ implies r(V k l ) ≥ r(V * l ) for each l 1, . . . , L. Using this result along with Eq. 23, we obtain for k ≥ k ′ . As k ∈ K → ∞, we have by continuity, so it follows that V * ∈ arg min V F β (V, W * ). For notational convenience, let By Eq. 22, we havẽ Because lim k ∈ K→ ∞ V k exists, the sequence {V k } k ∈ K is bounded. If r(·) is ℓ 0 , transformed ℓ 1 , or SCAD, then {r(V k )} k ∈ K is bounded. If r(·) is ℓ 1 , then r(·) is coercive. If r(·) is ℓ 1 − αℓ 2 , then r(·) is Frontiers in Applied Mathematics and Statistics | www.frontiersin.org February 2021 | Volume 6 | Article 529564 bounded above by ℓ 1 . Overall, this follows that {r(V k )} k ∈ K bounded as well. Hence, there exists a further subsequence K ⊂ K such that lim k ∈ K→ ∞ r(V k ) exists. So, we obtain lim k ∈ K→ ∞L (W + ) +R λ,β V k , W + lim after applying Eq. 26 in the third inequality and by continuity in the last equality.
Taking the limit over the subsequence K in Eq. 30 and applying Eq. 31, we obtaiñ L(W) +R λ,β V * , W ≥L W * +R λ,β W * , V * (32) by continuity. Adding L l 1 r(V * l ) on both sides yields which follows that W * ∈ arg min W F β (V * , W). This completes the proof.

Proof of Theorem 3
Because (V * , W * ) is an accumulation point, there exists a subsequence K such that lim k ∈ K→ ∞ (V k , W k ) (V * , W * ). If {F β k (V k , W k )} ∞ k 1 is uniformly bounded, there exists M such that F β k (V k , W k ) ≤ M for all k ∈ N. Then we have As a result, Taking the limit over k ∈ K, we have which follows that V * W * . As a result, (V * , W * ) is a feasible solution to Eq. 15.

DATA AVAILABILITY STATEMENT
The datasets MNIST and CIFAR 10/100 for this study are available through the Pytorch package in Python. Codes for the numerical experiments in Section 3 are available at https:// github.com/kbui1993/Official_Nonconvex_SGL.

AUTHOR CONTRIBUTIONS
KB and FP performed the experiments and analysis. All authors contributed to the design, evaluation, discussions and production of the manuscript.

FUNDING
The work was partially supported by NSF grants IIS-1632935, DMS-1854434, DMS-1924548, DMS-1952644 and the Qualcomm Faculty Award.