Adversarially Robust Learning via Entropic Regularization

In this paper we propose a new family of algorithms, ATENT, for training adversarially robust deep neural networks. We formulate a new loss function that is equipped with an additional entropic regularization. Our loss function considers the contribution of adversarial samples that are drawn from a specially designed distribution in the data space that assigns high probability to points with high loss and in the immediate neighborhood of training samples. Our proposed algorithms optimize this loss to seek adversarially robust valleys of the loss landscape. Our approach achieves competitive (or better) performance in terms of robust classification accuracy as compared to several state-of-the-art robust learning approaches on benchmark datasets such as MNIST and CIFAR-10.


A.1 Detailed training setup
Architectures: For MNIST-∞ experiments, we consider a CNN architecture with the following configuration (same as Zhang et al. (2019b)). Feature extraction consists of the following sequence of operations: two layers of 2-D convolutions with 32 channels, kernal size 3, ReLU activation each, followed by maxpooling by factor 2, followed by two layers of 2-D convolutions with 64 channels, kernel size 3, ReLU activation, and finally another maxpool (by 2) operation. This is followed by the classification module, consisting of a fully connected layer of size 1024 × 200, ReLU activation, dropout, another fully connected layer of size 200 × 200, ReLU activation and a final fully connected layer of size 200 × 10. Effectively this network has 4 convolutional and 3 fully connected layers. We use batch size of 128 with this configuration.
For MNIST-2 experiments, we consider the LeNet5 model from the Advertorch library (same as Ding et al. (2019)). This consists of a feature extractor of the form -two layers of 2-D convolutions, first one with 32 and second one with 64 channels, ReLU activation and maxpool by factor 2. The classifier consists of one fully connected layer of dimension 3136 × 1024 followed by ReLU activation, and finally another fully connected layer of size 1024 × 10. We use batch size of 50 with this configuration.
For CIFAR-∞ experiments we consider a WideResNet with 34 layers and widening factor 10 (same as Zhang et al. (2019b) and Madry et al. (2018)). It consists of a 2-D convolutional operation, followed by 3 building blocks of WideResNet, ReLU, 2D average pooling and fully connected layer. Each building block of the WideResNet consists of 5 successive operations of batch normalization, ReLU, 2D convolution, another batch normalization, ReLU, dropout, a 2-D convolution and shortcut connection. We use batch size of 128 with this configuration.
For CIFAR-2 experiments, we consider a ResNet with 20 layers. This ResNet consists of a 2-D convolution, followed by three blocks, each consisting of 3 basic blocks with 2 convolutional layers, batch normalization and ReLU. This is finally followed by average pooling and a fully connected layer. We use batch size of 256 with this configuration.
Training SGD and Entropy SGD models for MNIST experiments: For SGD, we trained the 7-layer convolutional network setup in Zhang et al. (2019b); Carlini and Wagner (2017) with the MNIST dataset, setting batch size of 128, for ∞ SGD optimizer using a learning rate of 0.1, for 50 epochs. For Entropy SGD, with 5 langevin steps, and γ = 10 −3 , batch size of 128 and learning rate of 0.1 and 50 total epochs.

A.2 2 ATENT
2 -PGD attacks on CIFAR10: We explore the effectiveness of 2 -ATENT as a defense against 2 perturbations. These results are tabulated in Table S1. We test 10-step PGD adversarial attacks at 2 = 0.5 and 2 = 1. For the purpose of this comparison, we compare pretrained models of MMA and PGD-AT at 2 = 1. To train ATENT, we use γ = 0.08 for 2 = 1, 10 step attack (with 2.5 2 /10 step size), K = 10 langevin iterations, langevin step η = 2 2 /K, learning rate for weights η = 0.1. ATENT achieves better robust accuracy against all baselines in Table S1. Specifically, the first three rows compare PGD, MMA and 2 -ATENT against PGD attack of radii 2 = 0.5, 1. All three models trained assume an attack budget of 2 = 1. With this setting, all three models have similar performance for a weaker attack of 2 = 0.5, whereas in the case of the stronger attack 2 = 1, ATENT performs best. ATENT performs better even though we train a ResNet20 which has less expressive power as compared to WideResNet28. This shows that even with a smaller network, ATENT can produce a model that is more robust as compared to PGD-AT and MMA at high attack radii.
We also compare models primarily trained to boost the certificate of randomized smoothing. If the robust test accuracy is 1 y=f (x;w) where 1 is an indicator vector with value 1 if y = f (x; w) and 0 if y = f (x; w), then the randomized smoothing certified robust accuracy is computed using a function g(x) = arg max y∈C P(f (x + δ; w) − y), C = {1, 2, . . . m} possible labels, with δ ∈ N (0, σ 2 ). In this case certified robust accuracy is 1 y=g(x;w) . TRADES (smoothing) (Blum et al. (2020)) algorithm aims to maximize this certified robust accuracy.
Even though ATENT algorithm by design does not maximize the certified robust accuracy, we test the generalization capablity of ATENT by comparing against smoothing version of TRADES (Blum et al. (2020)). For this we train a ResNet20 model for both TRADES smoothing version at default parameter setting and 2 ATENT, at η = 0.5 2 /K, γ = 0.05, σ(ε) = 0.577 such that the effective noise standard deviation is 0.12. These models are tested against PGD-10 attacks at radius 2 = 0.435. In all 2 ATENT experiments, we choose the value of γ = 0.05 such that the perturbation X K − X F ≈ 2 of corresponding models of TRADES and PGD-AT. For all ATENT experiments, we set α = 0.9. We see that ATENT does better in terms of standard robust accuracy.
Experiments on randomized smoothing: Since the formulation of ATENT is similar to a noisy PGD adversarial training algorithm, we test its efficiency towards randomized smoothing and producing a higher robustness certificate (Table S2). For this we train a ResNet-20 on CIFAR10, at γ = 0.05, η = 0.02, η = 0.1, K = 10, and tune the noise ε, such that effective noise √ 2η ε has standard deviation σ = 0.12. We compare the results of randomized smoothing to established benchmarks on ResNet-110 (results have been borrowed from Table 1 of Blum et al. (2020)) as well as a smaller ResNet-20 model trained using TRADES at its default settings. We observe that without any modification to the current form of ATENT, our method is capable of producing a competitive certificate to state of art methods. Since ATENT does not solve the randomized smoothing objective, we cannot expect to see optimal certified robust accuracies; however we still see competitive performance. In future work we aim to design modifications to ATENT which can serve the objective of certification. A.3 ∞ ATENT Training characteristics of ∞ ATENT: In Figure S1 we display the training curves of ATENT. As shown, the robust accuracies spike sharply after the first learning rate decay, followed by an immediate decrease in robust accuracies. This behavior is similar to that observed in Rice et al. (2020). This is also the key intuition used in the design of the learning rate scheduler for TRADES. Figure S1: Benign training, test and robust training. test accuracies of ATENT. The learning rate is decayed at epoch 76, where the robust test accuracy peaks. This is the accuracy reported. ATENT as Attack: For our ∞ -ATENT WideResNet-34-10, we also test ∞ -ATENT as an attack. We keep the same configuration as that of PGD-20, for ATENT. We compare the performance of our ∞ -ATENT trained model (specifically designed to work against ∞ =8/255 attacks). The values (Table S3) suggest that the adversarial perturbations generated by ATENT are similar in strength to those produced by PGD (worst possible attack). Computational complexity: In terms of computational complexity, ATENT matches that of PGD and TRADES, as can be observed from the fact that all three approaches are nested iterative optimizations. In Table S4 we tabulate the running time performance of PGD (without random restarts), TRADES and attent, per epoch. We use default experiment settings for PGD and TRADES to make these comparisons. Note that because we rely on an early stopping criterion, there is no fair way of comparing overall running time of all baselines. ATENT requires only 76 epochs overall, whereas TRADES is run for 100 epochs and PGD is run for 150 epochs. The running time per epoch of ATENT is in between TRADES and PGD, making it competent even in terms of time-complexity. We also probe the running time dependence on choice of number of Langevin iterations K. We see that there is a linear dependence on K of the running time.
Due to the high computational complexity of all adversarial algorithms, we test a fine-tuning approach, to trade computational complexity for accuracy. This method is suggested in Jeddi et al. (2020). In this context, we take a pre-trained WideResNet-34-10 which has been trained on benign CIFAR10 samples only. This model is then fine tuned on adversarial training data, via ∞ ATENT using a low learning rate η = 0.0001 and trained for only 20 epochs. The final robust accuracy at ∞ = 8/255 is 52.1%. This is accuracy marginally improves upon the robust accuracy observed (51.7%) for fine-tuned WideResNet-28-10 PGD-AT trained model in Jeddi et al. (2020). This experiments suggests that ATENT is amenable for fine tuning pretrained benign models using lesser computation, but at the cost of slightly reduced robust accuracy (roughly 5% drop at benchmark of ∞ = 8/255).

B.1 Theoretical properties of the augmented loss
We now state an informal theorem on the conditions required for convergence of SGLD in Eq. 7 for estimating adversarial samples X . We restate Lemma 3.1 as follows: LEMMA B.1. The effective loss F (X ; X, Y, w) := γ 2 X − X 2 F − L(X ; Y, w) which guides the Langevin sampling process in Eq. 7 is One can then use smoothness and dissipativity of F (X ; Y, w) to show convergence of SGLD for the optimization over X (Eq. 7) via Theorem 3.3 of Xu et al. (2017).
We first derive smoothness conditions for the effective loss We use abbreviations p(X ) := p(X ; X, Y, w), F (X ) := F (X ; X, Y, w), L(X ; Y, z) := L(X ) and L(X; Y, z) := L(X), and assume that X and X are vectorized. Unless specified otherwise, · refers to the vector 2-norm.
PROOF. Let us show that ∇ X F (X 2 ) − ∇ X F (X 1 ) ≤ β X 2 − X 1 . If the original loss function is β smooth, i.e., by application of the triangle inequality.
Next, we establish conditions required to show (m, b)-dissipativity for F (X ), i.e. ∇ X F (X ), X ≥ m X 2 2 − b for positive constants m, b > 0, ∀X . To show that: where the left side of inequality can be expanded as: To find the inner product −∇ X L(X ), X , we expand squares: Plugging this into (S1), and assuming Lipschitz continuity of original loss L(X ), i.e., ∇ X L(X ) 2 ≤ L: With Lemma B.1 we can show convergence of the SGLD inner optimization loop. To minimize overall loss function, the data entropy loss L DE is minimized w.r.t. w, via Stochastic Gradient Descent (SGD). The gradient update for weights w are designed via Eq. 6 as follows: Then a loose upper bound on Lipschitz continuity of L DE is ∇ w L DE (w; X, Y, γ) 2 ≤L(R + 1), if original loss isL-Lipschitz in w and L(X) ≤ R. Due to the complicated form of this expression, establishing β-smoothness will require extra rigor. We push a more thorough evaluation of the convergence of the outer SGD loop to future work. Chaudhari et al. (2019) claim that neural networks that favor wide local minima have better generalization properties, in terms of perturbations to data, weights as well as activations. Mathematically, the formulation in Entropy SGD can be summarized as follows. A basic way to model the distribution of the weights of the neural network is using a Gibbs distribution of the form:
Here γ controls the width of the valley; if γ → ∞, the sampling is sharp, and this corresponds to no smoothing effect, meanwhile γ → 0 corresponds to a uniform contribution from all points in the loss manifold. The standard objective is: which can be seen as a sharp sampling of the loss function. Now, if one defined the Local Entropy as: our new objective is to minimize this augmented objective function L ent (w; X, Y ), which resembles a smoothed version of the loss function with a Gaussian kernel. The SGD update can be designed as follows: Then, using this gradient, the SGD update for a given batch is designed as: This gradient ideally requires computation over the entire training set at once; however can be extended to a batch-wise update rule by borrowing key findings from Welling and Teh (2011). This expectation for the full gradient is computationally intractable, however, Euler discretization of Langevin Stochastic Differential Equation, it can be approximated fairly well as w t+1 = w t + η t ∇ w log p(w t ) + 2ηN (0, I) such that after large enough amount of iterations w + → w ∞ then w ∞ ∼ p(w ). One can estimate E w ∼p(w ) γ(w − w ) by averaging over many such iterates from this process. This result is stated as it is from (Chaudhari et al. (2019)): E w ∼p(w ) [g(w )] = t η t g(w t ) t η t . This leads to the algorithm shown in Algorithm 1. One can further accrue exponentially decaying weighted averaging of g(w t ) to estimate E w ∼p(w ) [g(w )]. This entire procedure is described in Algorithm 1.
This algorithm is then further guaranteed to find wide minima neighborhoods of w by design, as sketched out by the proofs in Chaudhari et al. (2019).
While this update rule in itself suffices, if the parameters are conditioned on a a training sample set X, which is typically large, the gradient term in Eq. S3 is expensive to compute. (Welling and Teh (2011)) shows that the following batch-wise update rule: θ t+1 = θ t + η∇ θ p(θ t ; X B j ) + 2η ε