Defense Against Explanation Manipulation

Explainable machine learning attracts increasing attention as it improves the transparency of models, which is helpful for machine learning to be trusted in real applications. However, explanation methods have recently been demonstrated to be vulnerable to manipulation, where we can easily change a model's explanation while keeping its prediction constant. To tackle this problem, some efforts have been paid to use more stable explanation methods or to change model configurations. In this work, we tackle the problem from the training perspective, and propose a new training scheme called Adversarial Training on EXplanations (ATEX) to improve the internal explanation stability of a model regardless of the specific explanation method being applied. Instead of directly specifying explanation values over data instances, ATEX only puts constraints on model predictions which avoids involving second-order derivatives in optimization. As a further discussion, we also find that explanation stability is closely related to another property of the model, i.e., the risk of being exposed to adversarial attack. Through experiments, besides showing that ATEX improves model robustness against manipulation targeting explanation, it also brings additional benefits including smoothing explanations and improving the efficacy of adversarial training if applied to the model.


Introduction
Despite the significant improvements over traditional approaches in many tasks, deep models are usually criticized as being black-boxes [9,19,26].To tackle this problem, explanation methods have attracted increasing attention as they provide a tool for understanding how predictions are made by complex models.Methods that produce feature importance maps [30,31,33] are commonly used as their explanation results are visually intuitive.Furthermore, explanation methods are expected by model developers to diagnose the defects in models [14,20,26] or abnormalities in data instances [10].
Nevertheless, recent work discovered that explanation methods, when applied to deep models, are easy to be manipulated [11].That is, we are able to change explanation results without changing model predictions.To tackle this challenge, some efforts [37] have been paid to improve the stability of explanation methods by using SmoothGrad [31].In addition, [8] proposes to replace ReLU activation with the smoothed softplus function to obtain explanations similar to SmoothGrad.However, in the original work [11], the ReLU activation has already been changed to softplus function, while explanations could still be easily manipulated.It thus implies that more effective techniques, besides smoothing explanations or activation functions, are needed in order to stabilize explanation results.
In this work, we try to modify the training process of neural models to improve their inherent robustness against manipulation targeting explanations.We call our approach as Adversarial Training on EXplanations (ATEX).Different from existing efforts which try to select or design a specific explainer that is more stable [18,37], ATEX could benefit various existing explanation methods.Different from the method in [8], we do not need to change the model architecture.More precisely, through training with augmented data, ATEX regularizes model explanations around data samples.However, explicitly controlling explanation results is computationally prohibitive as it requires a significant amount of computation for second-order gradients.Therefore, ATEX implicitly regularizes explanation, and it only requires information of model predictions (zero-order) and gradients (firstorder).
Besides stabilizing model explanation, ATEX also brings two additional advantages.First, ATEX helps smoothing the feature importance maps of models, even we only use raw gradient instead of SmoothGrad to compute feature importance.Second, ATEX could improve the efficacy of adversarial training on predictions [13,22] which defends against adversarial samples that cause the model to make wrong predictions.Specifically, traditional adversarial training [13] suffers from the problem that models easily overfit to adversarial examples [22], and an adversarially trained model turns out to be less robust against adversarial examples crafted with different perturbation directions.In this work, we show that the ineffectiveness of adversarial training stems from the same source as model interpretation instability.As a result, applying ATEX will increase the efficacy of adversarial training.
The key contributions of this work is summarized as below: • We propose a novel adversarial training method called ATEX to increase the stability of explanation of models, so that explanation results are less sensitive to malicious manipulation.• Models trained with ATEX will produce visually smoothed feature importance maps with one-shot gradient, without applying sophisticated approaches such as SmoothGrad.• We discuss the positive correlation between interpretation stability and adversarial training efficacy.Through experiments, we show that the efficacy of adversarial training is improved when applied on models fine-tuned with ATEX.
To avoid confusion, we use "manipulation" to refer to attack on explanation, while "adversarial attack" still means attack on model prediction.Correspondingly, we use "ATEX" to mean adversarial training on explanation, while "adversarial training" alone still means the defense method to improve prediction robustness.
2 Algorithm Design for Defense Against Manipulation

Explanation Manipulation
We consider the target neural network model f : R D → R C with softplus non-linearities, where an input instance x ∈ R D is predicted as belonging to class c * = arg max c f c (x).Given an instance x of interest, the explanation for prediction f c (x) is φ(f c , x), where φ : F × R D → R D denotes the explanation function.To facilitate discussion, during the development of ATEX, we assume φ is based on vanilla gradient [30], i.e., φ(f c , x) = ∇ x f c (x).The relative importance score of the t-th feature is computed as , which is commonly used in feature importance maps.We will further discuss the scenarios of using other explanation methods in experiments.
The problem of manipulating explanation could be formulated as below [12]: where d(•, •) is the manipulation objective, the first constraint limits perturbation range, and the second constraint preserves prediction.Some typical objectives include: • Targeted Attack controls explanation to be close to certain predefined patterns, where T is the set of features that the manipulator wants to highlight.• Untargeted Attack suppresses the contribution of features that were considered as important in clean samples, where It is worth noting that T contains different elements between targeted and untargeted attack scenario.Figure 1: Illustration of explanation stability and ATEX idea.(a): One perspective of why the explanation is prone to be manipulated, i.e., moving an instance along φ ⊥ will change its explanation as well as prediction.(b): Illustration of ATEX training process (overhead view from y-axis), where each augmented data instance goes through two rounds of sampling.In the first round, x i is sampled along explanation direction.In the second round, x p is sampled perpendicularly to explanation.(c) and (d): A prediction function (an ideal case) that is robust to explanation manipulation.

A Naïve Solution
Assume g is the new model to train, a straightforward design for adversarial training is to explicitly require explanations to be constant within the neighborhood of each training sample: where L(•, •) denotes the instance-level training loss between a prediction and the true label.N (x, ) denotes the neighborhood around x within distance of .The last term in the inner summation explicitly controls the variation of explanation around training samples, while the other terms preserve model prediction performance.Such a design closely mimics the paradigm of traditional adversarial training over model predictions [13].
Nevertheless, there are two problems for the formulation in Equation 2. First, since φ usually relies on first-order partial derivative information, optimization over explanation maps require computing and propagating second-order partial derivatives, which could be costly to iterate over all training samples.Second, the first term in Equation 2 assumes that φ(g y , x) is the ground-truth explanation.However, there could be defects (e.g., noises) in φ(g y , x), which makes it not a good target to fit.In addition, since we mainly care about the stability of explanation, specifying a concrete ground-truth may not be necessary.

Adversarial Training on Explanations (ATEX)
, where H is the Hessian matrix and H i,j = ∂f ∂xi∂xj .If f is simply a linear model, then φ is robust to any manipulation since the Hessian matrix is all-zero.However, a hard requirement to eliminate non-linearity in a deep model would reduce its prediction accuracy.The requirement could be relaxed as long as the explanation is stable according to the below definition.Definition 1.We define the stability of explanation around an instance x as: ( Different from the proposition in [12], we assume a positive scaling does not change explanation, as the relative importance between features is not changed.This is why a coefficient γ is introduced here.The definition is compatible with the common metrics for explanation similarity such as Spearman correlation and top elements intersection [8,12].One form of f that has stable explanation locally around x could be written as f (x) = σ(φ x), where the weights are defined with explanation vector and σ : R → R is a monotonically increasing non-linear function.We have φ(f, x) = σ (φ x) • φ.
Since σ (φ x) is a scalar, perturbing input with ∆x only re-scales φ, thus satisfying the definition above if we let γ = σ (φ x).
Considering the definition above, there are two factors to consider in algorithm design: (i) how to set the nonlinear function σ; (ii) how to regularize f for stable explanation.The high-level idea of ATEX is illustrated in Figure 1.ATEX is a fine-tuning process given the target model f .The formal loss function of ATEX is: min g x∈X J(g, f, x), where The first term is the distillation loss, and the second term could be seen as a regularizer.Given a seed instance x ∈ X from the dataset, two additional sampling process is conducted.In Equation 4, the outer summation generates a set of samples, denoted as I(x), along the explanation direction of x.
That is, where δ 1 denotes the shift distance, and ∆ 1 is a hyperparameter.To guarantee that we are sampling along a representative explanation direction on the prediction function surface, here we use Smooth-Grad [31] to compute φ in order to remove noise.The inner summation generates samples, denoted as P(x i ), along the perpendicular direction of explanation φ(f, x).Specifically, where φ ⊥ denotes the perpendicular direction to φ.To compute φ ⊥ , we first generate a random perturbation u ∼ U (0, ∆ 2 ), and Here U denotes uniform distribution.The rationale behind moving samples along φ ⊥ is that, restricting these samples to have the same prediction as f (x i ) implicitly requires the local explanation to be fixed at φ.As shown in the right half of Figure 1.

Explanation Stability vs Adversarial Training Efficacy
One of the best known adversarial training method is robust optimization [22].The goal is to approximately solve: min f E[max x ∈N (x, ) L(f (x ), y)].The inner maximization problem is usually solved through attacking algorithms such as FGSM [13] and PGD [17], where x can be seen as the most threatening adversarial sample as it maximizes the loss.The outer problem trains model parameters to minimize the loss.

Manipulation
One issue for the above method is that, simply defending against the most threatening adversarial sample is not enough to guarantee prediction robustness.First, other adversarial samples, although leading to smaller losses, could still exist.Second, more adversarial samples could be discovered by using different attacking algorithms.An illustration of such a risk is shown in the lower part of the right figure.Suppose x is the adversarial sample by perturbing x.A new decision boundary is learned via certain defense method, so that x can no longer fool model prediction.However, it is still possible to perturb x towards other directions (e.g., to x ).This prediction is also under the risk of having its explanation been manipulated, as shown in the upper part of the figure.A relation between explanation and adversarial perturbation can be proven as below: Theorem 1.Given a data instance x 0 , let explanation φ(f c , x 0 ) be defined using vanilla gradient [30], and adversarial perturbation δ be crafted using FGSM [17] without the additional sign() operation, then we have φ(f c , x 0 ) ∝ −δ.The proof can be found in supplementary material.
Proof.According to [30], f c (x 0 ) is explained via linear approximation by computing its first-order Taylor expansion: where φ(f c , x) = w c = ∇ x f c (x 0 ).On the other and, in FGSM [13], let L(f (x 0 ), y) be the cross entropy loss, and the target label to be c, then where 1 fc(x0) is a scalar.Therefore, we have φ(f c , x) ∝ −δ.
Therefore, if a prediction f c (x) does not have a stable explanation, then this prediction could potentially be attacked towards multiple directions, thus requiring doing more iterations of adversarial training.In experiments, we will show that ATEX could improve the efficacy of adversarial training in each iteration.

Experiments
The experimental results here demonstrate the efficacy of ATEX on several aspects.Specifically, in Section 4.  • Metrics for Interpretation Similarity.Following the settings in [12], we consider three metrics for quantifying the similarity between two feature importance maps.To measure statistic similarity, we have Spearman's rank order correlation which utilizes rank correlation to compare the similarity, and Top-k intersection which compares similarity by the size of intersection of the k most important features.For visual similarity, we adopt the Structural Similarity Index (SSIM), which measures the perceptual difference between two similar images.

Defense Performance Against Explanation Manipulation Attack
In this section, we conduct experiments to measure the interpretation stability of models after applying ATEX.To manipulate explanations, we adopt the two explanation attack approaches introduced in Section 2.2.For targeted attack, we manage to increase model's attention in a predefined region with a size of 5×5 pixels, which are determined randomly in runtime.For untargeted attack, we suppress the contribution of the 50 most important pixels in original samples.Due to the piecewise-linear property [12] of deep models that use ReLU as activation function, attacking methods that rely on Hessian matrices will not work since second-order gradients are zero.Hence, in this work, we replace ReLU activation with smoothed softplus activation when training models, so [8] can be seen as the baseline method.Subsequent steps such as generating explanations, manipulation samples, and applying defense, are all implemented on softplus activated models.
Results are summarized in Table 1∼ Table 4. Compared with the baseline method, we see that ATEX improves the stability of interpretation, in terms of both Rank Correlation and Top-k Intersection metrics.The relative improvement is more significant as the attack magnitude 1 increases.A larger 1 means a greater manipulation range (∆ 1 and ∆ 2 are set to be equal to 1 ).The model prediction accuracy will be slightly affected on FashionMNIST, but remains consistent on MNIST.

Qualitative Assessment of Explanation
In this part, we show that ATEX helps reducing noises in interpretation feature maps, even when we only use vanilla gradient [30] as the interpretation method.We choose SmoothGrad [31] as the reference method, because SmoothGrad can reduces the noise in sensitivity maps, and we use SmoothGrad to provide direction to generate x i in ATEX.In our experiment, we run SmoothGrad  Three images form a case, which consists of an input, a gradient explanation from the original network, and a gradient explanation from ATEX-trained network.on normally training models without applying ATEX.Specifically, we add pixel-wise Gaussian noise to 100 copies of each test image and compute the average of vanilla gradients to get feature maps.In comparison, after running ATEX for 5 iterations, we use vanilla gradient to produce feature importance maps directly for test images.The baseline feature maps are obtained by vanilla gradient on normally trained models.We expect the interpretation results of ATEX to be more similar to Smoothgrad than baseline results.This is validated in Figure 2, as ATEX achieve higher SSIM scores than the baseline results.We also show the explanation results in Figure 3.We could observe that the noise level is significantly reduced in the feature maps after applying ATEX training to models, even though we only use vanilla gradient to generate feature maps.It thus indicates that models trained with ATEX are more focused on the objects in input.

Efficacy of Adversarial Training After Applying ATEX
We now investigate the correlation between explanation stability and adversarial training efficacy.Our analysis in Section 3 demonstrates that stability in explanation can potentially improve the efficacy of adversarial training.In this experiment, given a pretrained classifier, we run ATEX for several iterations.After each iteration, to evaluate the efficacy of adversarial training, we further fine-tune the classifier with adversarial training and then evaluate the robustness of the resultant model against a new round of attack.We adopt FGSM as the approach for both adversarial samples generation.The attack step length = 0.1.For the adversarial training, we generate 50,000 FGSM attack samples from training data and combine them with original training data to fine-tune the model.Results are shown in Figure 4.The x-axis denotes the number of iterations of ATEX, where iteration = 0 means pure adversarial training without using ATEX.From the figures, we observe that as we run more iterations of ATEX, the performance of adversarial training also increases.It indicates that ATEX reduces the potential weakness contained in models.

Related Work
Model explanations could be generally indicated and defined as the information which can help people understand the model behaviors.Typically, those useful information could be some significant features that contribute a lot to model predictions.To effectively extract explanations from models, there are two major methodologies, where the first category is based on instance perturbation [25] and the second is based on gradient information [3].As for the first category, LIME [25] is a representative method, utilizing shallow linear models to approximate the model local behaviors with feature importance scores.Further, SHAP [21] unifies and generalizes the perturbation-based method with the aid of cooperative game theory, where each feature would be assigned with a Shapley value for explanation purposes.Some other important methods within this category can also be found in [4,7,27].As for the second category of methods, explanations are mainly extracted and calculated according to the model gradients.Representative methods can be found in [28,6,34,29,32], where gradients are used as an indicator for feature sensitivity towards model predictions.In this work, we specifically focus on the second category of methods for generating explanations, and aim to make the gradient-based explanations more robust and stable.
Although model explanations are useful, it can be fragile and easy to be manipulated under certain circumstances.In [11], the authors showed that the gradient-based explanations can be sensitive to imperceptible perturbations of images, which could lead to the unstructured changes in the generated salience maps.One of the approaches proposed in [16] utilized a constant shift on the target instance to manipulate the explanation salience map, where the biases of the neural network are also changed to fit the original prediction.Besides, parameter randomization [1] and network fine-tuning [15] are also effective approaches in manipulating explanations.To effectively handle such issue, robust and stable explanations are preferred for model interpretability.In [37], the authors rigorously define two concepts for generating smooth explanations (i.e., fidelity and sensitivity), and further propose to optimize these metrics for robust explanation generation.Also, the authors in [8,12] replace the common ReLU activation function with the softplus function, aiming to smooth the explanations during the model training process.Moreover, utilizing the Lipschitz constant of the explanations to locally lower the sensitivity to small perturbations is another valid methodology to improve the explanation robustness [2,23].Our work will specifically focus on the model training perspective for explanation stability under a relatively general setting.
Besides manipulation over interpretation, a more well studied domain of machine learning security is adversarial attack and defense on model prediction.Adversarial attack on model prediction refers to perturbing input in order to change its prediction results by the model, even though most of the attacks cannot be perceived by humans [13,35].Adversarial attack can be categorized into different categories according to the threat model, including untargeted attack vs targeted attack [5], one-shot attack vs iterative attack [17], data dependent vs universal attack [24], perturbation attack vs replacement attack [36].Considering such relation between model explanation and adversarial attack, our work also discuss the potential benefit to the target model with the aid of the explanation stability.

Conclusion
Despite the unique role in improving transparency for neural networks, interpretation methodologies have recently been shown to be vulnerable to manipulation.That is, malevolent users could slightly perturb the input to change its interpretation result while maintaining prediction output.In this work, we propose a new training method called ATEX, which tries to improve model interpretation robustness against manipulation on input.ATEX does not explicitly control interpretation, but implicitly regularize it via control the predictions around training samples.We also show that interpretation stability is closely related to the potential efficacy of adversarial training, since adversarial attack direction has a strong relation to interpretation.Through experiments, we show that ATEX could stabilize interpretation of model predictions.ATEX also reduce noises in feature importance maps, similar to SmoothGrad, even the maps are obtained with vanilla gradient.In addition, ATEX boosts the efficacy of adversarial training.
Future work could investigate how to detect manipulated inputs, which is more efficient especially on large datasets, instead of retraining models.Another interesting direction is how to improve training with augmented data so that the prediction accuracy on clean samples will not decrease.

Figure 2 :
Figure 2: Quantitative evaluation of interpretation smoothness effect.

Figure 3 :
Figure3: Gradient explanation map produced from the original network and the network trained with ATEX.Three images form a case, which consists of an input, a gradient explanation from the original network, and a gradient explanation from ATEX-trained network.

Figure 4 :
Figure 4: Efficacy of adversarial training after apply ATEX.

Table 2 :
Defense against targeted explanation manipulation on FashionMNIST.

Table 4 :
Defense against targeted explanation manipulation on MNIST.
4.1 Experiment Settings • Datasets.We conduct our experiment on the Fashion-MNIST dataset and MNIST dataset.Fashion-MNIST consists of a training set of 60,000 examples and a test set of 10,000 examples.Each example is a 28 × 28 gray-scale image with a label from 10 categories.Image pixels of all examples are normalized to [0, 1] range.The classification model has two convolutional layers and