Improving Adversarial Robustness via Attention and Adversarial Logit Pairing

Though deep neural networks have achieved the state of the art performance in visual classification, recent studies have shown that they are all vulnerable to the attack of adversarial examples. In this paper, we develop improved techniques for defending against adversarial examples. First, we propose an enhanced defense technique denoted Attention and Adversarial Logit Pairing (AT + ALP), which encourages both attention map and logit for the pairs of examples to be similar. When being applied to clean examples and their adversarial counterparts, AT + ALP improves accuracy on adversarial examples over adversarial training. We show that AT + ALP can effectively increase the average activations of adversarial examples in the key area and demonstrate that it focuses on discriminate features to improve the robustness of the model. Finally, we conduct extensive experiments using a wide range of datasets and the experiment results show that our AT + ALP achieves the state of the art defense performance. For example, on 17 Flower Category Database, under strong 200-iteration Projected Gradient Descent (PGD) gray-box and black-box attacks where prior art has 34 and 39% accuracy, our method achieves 50 and 51%. Compared with previous work, our work is evaluated under highly challenging PGD attack: the maximum perturbation ϵ ∈ {0.25, 0.5} i.e. L ∞ ∈ {0.25, 0.5} with 10–200 attack iterations. To the best of our knowledge, such a strong attack has not been previously explored on a wide range of datasets.


INTRODUCTION
In recent years, deep neural networks have been extensively deployed for computer vision tasks, particularly for visual classification problems, where new algorithms have been reported to achieve even better performance than human beings Krizhevsky et al. (2012), He et al. (2015), Li et al. (2019a). The success of deep neural networks has led to an explosion in demand. However, recent studies have shown that they are all vulnerable to the attack of adversarial examples Szegedy et al. (2013); Carlini and Wagner (2016); Moosavi-Dezfooli et al. (2016); Bose and Aarabi (2018). Small and often imperceptible perturbations to the input images are sufficient to fool the most powerful deep neural networks.
In Figure 1, we visualize the spatial attention map of a flower and its corresponding adversarial image on ResNet-50 He et al. (2015) pretrained on ImageNet Russakovsky et al. (2015). The figure suggests that adversarial perturbations, while small in the pixel space, lead to very substantial "noise" in the attention map of the network. Whereas the features for the clean image appear to focus primarily on semantically informative content in the image, the attention map for the adversarial image are activated across semantically irrelevant regions as well. The state of the art adversarial training methods only encourage hard labels Madry et al. (2017); Tramèr et al. (2017) or logit Kannan et al. (2018) for pairs of clean examples and adversarial counterparts to be similar. In our opinion, it is not enough to align the difference between the clean examples and adversarial counterparts only at the end part of the whole network, i.e., hard labels or logit, and we need to align the attention maps for important parts of the whole network. Motivated by this observation, we explore Attention and Adversarial Logit Pairing(AT + ALP), a method that encourages both attention map and logit for pairs of examples to be similar. When being applied to clean examples and their adversarial counterparts, AT + ALP improves accuracy on adversarial examples over adversarial training.
The contributions of this paper are summarized as follows:   Russakovsky et al. (2015) which shows where the network focuses in order to classify the given image. (C) is adversarial image of (A), (D) is corresponding spatial attention map.
FIGURE 2 | Schematic representation of Attention and Adversarial Logit Pairing (AT + ALP): a baseline model is trained so as, not only to make similar logits, but to also have similar spatial attention maps to those of original image and adversarial image.
The rest of the paper is organized as follows: in Section 2, we present the related works; in Section 3, we introduce definitions and threat models; in Section 4 we propose our Attention and Adversarial Logit Pairing(AT + ALP) method; in Section 5, we show extensive experimental results; and Section 6 concludes.  1 | Defense against white-box attack on CIFAR-10. The adversarial perturbations were produced using Fast Gradient Sign (FGS) Goodfellow et al. (2015), Projected Gradient Descent (PGD) Madry et al. (2017), AutoAttack (AA) Croce and Hein (2020) and RayS Chen and Gu (2020  images that are generated on-the-fly during training. For adversarial training, the most relevant work to our study is Kannan et al. (2018), which introduce a technique they call Adversarial Logit Pairing (ALP). This method encourages logits for pairs of examples to be similar. Our AT + ALP encourages both attention map and logit for pairs of examples to be similar. When being applied to clean examples and their adversarial counterparts, AT + ALP improves accuracy on adversarial examples over adversarial training. Araujo et al. (2019) adds random noise at training and inference time,  adds denoising blocks to the model to increase adversarial robustness, while neither of the above approaches focuses on the attention map.

RELATED WORK
In terms of methodologies, our work is also related to deep transfer learning and knowledge distillation problems, and the most relevant work to our study is Zagoruyko and Komodakis (2016); Li et al. (2019b), which constrain the L 2 -norm of the difference between their behaviors (i.e., the feature maps of outer layer outputs in the source/target networks). Our AT + ALP constrains attention map and logit for pairs of clean examples and their adversarial counterparts to be similar.

DEFINITIONS AND THREAT MODELS
In this paper, we always assume the attacker is capable of forming attacks that consist of perturbations of limited L ∞ -norm. This is a simplified task chosen because it is more amenable to benchmark evaluations. We consider two different threat models characterizing amounts of information the adversary can have: • Gray-box Attack We focus on defense against gray-box attacks in this paper. In a gray-back attack, the attacker knows both the original network and the defense algorithm.
Only the parameters of the defense model are hidden from the attacker. This is also a standard setting assumed in many security systems and applications Pfleeger and Pfleeger (2004). • Black-box Attack The attacker has no information about the model's architecture or parameters, and no ability to send queries to the model to gather more information.

Adversarial Training
We use adversarial training with Projected Gradient Descent (PGD) Madry et al. (2017) as the underlying basis for our methods: wherep data is the underlying training data distribution, L (θ, x + δ, y) is a loss function at data point x which has true class y for a model with parameters θ, and the maximization with respect to δ is approximated using PGD. In this paper, the loss is defined as: where L CE is cross entropy, α and β are hyperparameters.

Adversarial Logit Pairing
We also use Adversarial Logit Pairing (ALP) to encourage the logits from clean examples and their adversarial counterparts to be similar to each other. For a model that takes inputs x and computes a vector of logit z f(x), logit pairing adds a loss: In this paper we use L 2 loss for L a .

Attention Map
We use Attention Map (AT) to encourage the attention map from clean examples and their adversarial counterparts to be similar to each other. Let I denote the indices of all activation layer pairs, for which we want to pay attention. Then, we can define the following total loss:

Experiments: White-Box Settings
White-box attack is the most challenging task for evaluating a model's adversarial robustness. In white-box settings, attackers are assumed to know all details about the model, including its architecture and parameters. We conduct white-box experiments following common practices Madry et al. (2017); Kannan et al. (2018). Specifically, we use ResNet-18 He et al. (2015) trained with CIFAR-10 Krizhevsky and Hinton (2009).  (2020) and RayS Chen and Gu (2020) to perform white-box attacks towards evaluated models. We consider untargeted attack, which is more challenging for defense than targeted attack. Adversarial perturbations are measured by L ∞ norm (i.e., maximum perturbation for each pixel), with an allowed maximum value of ϵ 8/255.

Image Database
The CIFAR-10 Krizhevsky and Hinton (2009) dataset contains 50,000 training samples and 10,000 test samples, uniformly distributed across 10 classes. Each sample is a 32 × 32 color image. Though with a low image resolution, CIFAR-10 is a popular benchmark to evaluate the adversarial robustness of a model. For adversarial attacks, we adopt 1-step FSG attack Goodfellow et al. (2015), 7-iteration PGD attack Madry et al. (2017) and AutoAttack Croce and Hein (2020) with the common used perturbation magnitude of ϵ 8/255 under L ∞ norm. We also evaluate them with RayS Chen and Gu (2020), which is a gradient-free adversarial attack requiring only the target model's hard-label output. We run each experiments three times and report the average top-1 accuracy. We also report the training time of each method for a more comprehensive comparison. Our experiments are run on Nvidia Tesla V100-SXM2 GPUs.

RESULTS AND DISCUSSION
We present results of the white-box experiment in Table 1. We compare the proposed Attention adversarial training (AT) against relevant methods including PAT Madry et al. (2017), ALP Kannan et al. (2018) and TRADES Zhang et al. (2019). As seen in Table 1, all of these methods show certain degree of robustness, even under the advanced adversarial attacks such as AutoAttack. Specifically, our AT is superior to baseline methods PAT and ALP, with higher clean accuracy, robust accuracy under FSG, PGD and AutoAttack. TRADES Zhang et al. (2019) improves ALP by involving an inner maximization to generate a most different counterpart for the clean example. Therefore, TRADES achieves higher adversarial accuracy than other methods. However, the drawback lies in its efficiency, i.e. TRADES is slower than other adversarial training methods by about %46. This is because TRADES needs 10 adversarial steps per batch to achieve good performance, while seven steps are enough for ALP and AT. Moreover, the proposed AT achieves the highest clean accuracy among all these adversarial training methods.
RayS Chen and Gu (2020) performs adversarial attack from a different perspective. As RayS is gradient-free and independent of certain adversarial losses, it can be used to detect possible falsely robust models, especially those may overfit to specific types of gradient-based attacks and adversarial losses. As seen in Table 1, all advanced adversarial training methods including AL, AT and TRADES, show higher robustness under RayS attack. Our results are consistent with those reported in RayS Chen and Gu (2020) that, when evaluated on really robust models, the robust accuracy of RayS is usually higher than that of standard PGD.

Experiments: Gray and Black-Box Settings
To evaluate the effectiveness of our defense strategy, we performed a series of image-classification experiments on 17 Flower Category Database Nilsback and Zisserman (2006), Part of ImageNet Database and Dogs-vs.-Cats Database. Following Athalye et al. (2018); , we assume an adversary that uses the state of the art PGD adversarial attack method.
We consider untargeted attacks when evaluating under the gray and black-box settings; untargeted attacks are also used in our adversarial training. We evaluate top-1 classification accuracy on validation images that are adversarially perturbed by the attacker. In this paper, adversarial perturbation is considered under L ∞ norm. The value of ϵ is relative to the pixel intensity scale of 256, we use ϵ 64/256 0.25 and ϵ 128/256 0.5. PGD attacker with 10-200 attack iterations and step size α 1.0/256 0.0039. Our baselines are ResNet-101/152. There are four groups of convolutional structures in the baseline model, group-0 extracts of low-level features, group-1 and group-2 extract of mid-level features, group-3 extracts of high-level features Zagoruyko and Komodakis (2016), which are described as conv2_x, conv3_x, conv4_x and conv5_x in He et al. (2015).

Image Database
We performed a series of image-classification experiments on a wide range of datasets.
• 17 Flower Category Database Nilsback and Zisserman (2006) contains images of flowers belonging to 17 different categories. The images were acquired by searching the web and taking pictures. There are 80 images for each category. . We see that ALP and AT sometimes induces decreased loss near the input locally, and gives a "bumpier" optimization landscape, our AT + ALP has better robustness. The z axis represents the loss. If x is the original input, then we plot the loss varying along the space determined by two vectors: r1 sign(▽ x f (x)) and r2 ∼ Rademacher (0.5). We thus plot the following function: z loss (x · r1 + y · r2).
Frontiers in Artificial Intelligence | www.frontiersin.org January 2022 | Volume 4 | Article 752831 • Part of ImageNet Database contains images of four objects. These four objects are randomly selected from the ImageNet Database Russakovsky et al. (2015). In this experiment, they are tench, goldfish, white shark and dog. Each object contains 1,300 training images and 50 test images. • Dogs-vs.-Cats Database 1 contains 8,000 images of dogs and cats in the train dataset and 2,000 in the test val dataset.

Experimental Setup
To

Results and Discussion
Here, we first present results with AT + ALP on 17 Flower Category Database. Compared with previous work, Kannan et al. (2018) was evaluated under 10-iteration PGD attack and ϵ 0.0625, our work are evaluated under highly challenging PGD attack:the maximum perturbation ϵ ∈ {0.25, 0.5}, i.e., L ∞ ∈ {0.25, 0.5} with 10-200 attack iterations. The bigger the value of ϵ, the bigger the disturbance, the more significant the adversarial image effect is. To the best of our knowledge, such a strong attack has not been previously explored on a wide range of datasets. As shown in Figure 3 that our AT + ALP outperform the state-ofthe-art in adversarial robustness against highly challenging gray-box and black-box PGD attacks. For example, under strong 200-iteration PGD gray-box and black-box attacks where prior art has 34 and 39% accuracy, our method achieves 50 and 51%. Table 2 shows Main Result of our work: under strong 200iteration PGD gray-box and black-box attacks, our AT + ALP outperform the state-of-the-art in adversarial robustness on all these databases.
We visualized activation attention maps for defense against PGD attacks. Baseline model is ResNet-101 He et al. (2015), which is pretrained on ImageNet Russakovsky et al. (2015) and fine-tuned on 17 Flower Category Database Nilsback and Zisserman (2006), group-0 to group-3 represent the activation attention maps of four groups of convolutional structures in the baseline model, i.e., conv2_x, conv3_x, conv4_x and conv5_x of ResNet-101, group-0 extracts of low-level features, group-1 and group-2 extract of mid-level features, group-3 extracts of high-level features Zagoruyko and Komodakis (2016);. We found from Figure 4 that group-0 of AT + ALP can extract the outline and texture of flowers more accurately, and group-3 has a higher level of activation on the whole flower, compared with other defense methods, only AT + ALP makes accurate prediction.
We compared average activations on discriminate parts of 17 Flower Category Database for different defense methods. 17 Flower Category Database defined discriminative parts of flowers. See Figure 5 for an illustrative example. These discriminative parts are annotated by humans, according to their contributions to recognize a target. In other words, they are crucial features for the classification. For example, the head and feather should be discriminative parts to recognize a species of bird. Using all testing examples of 17 Flower Category Database, we calculated normalized activations on these key regions of these different defense methods. As shown in Table 3, AT + ALP got the highest average activations on those key regions, demonstrating that AT + ALP focused on more discriminate features for flowers recognition. We also demonstrate in Figure 6 that AT + ALP shows smoother loss landscapes, which further verifies its effectiveness.

CONCLUSION
In this paper, we introduced enhanced defense using a technique we called Attention and Adversarial Logit Pairing (AT + ALP), a method that encouraged both attention map and logit for pairs of examples to be similar. When being applied to clean examples and their adversarial counterparts, AT + ALP improved accuracy on adversarial examples over adversarial training. Our AT + ALP achieves the state of the art defense on a wide range of datasets against PGD gray-box and black-box attacks. Compared with other defense methods, our AT + ALP is simple and effective, without modifying the model structure, and without adding additional image preprocessing steps.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding authors.