MeetSafe: Enhancing Robustness against White-Box Adversarial Examples

Stenhuis, Ruben; Liu, Dazhuang; Qiao, Yanqi; Conti, Mauro; Panaousis, Manos; Liang, Kaitai

doi:10.3389/fcomp.2025.1631561

METHODS article

Front. Comput. Sci.

Sec. Computer Security

Volume 7 - 2025 | doi: 10.3389/fcomp.2025.1631561

This article is part of the Research TopicEnhancing AI Robustness in Cybersecurity: Challenges and StrategiesView all articles

MeetSafe: Enhancing Robustness against White-Box Adversarial Examples

Provisionally accepted

Ruben Stenhuis¹

Manos Panaousis³

¹Delft University of Technology, Delft, Netherlands
²University of Padua, Padua, Veneto, Italy
³University of Greenwich, London, London, United Kingdom

The final, formatted version of the article will be published soon.

Convolutional neural networks (CNNs) are vulnerable to adversarial attacks in computer vision tasks. Current adversarial detections are ineffective against white-box attacks and inefficient when deep CNNs generate high-dimensional hidden features. This paper proposes MeetSafe, an effective and scalable adversarial example (AE) detection against white-box attacks. MeetSafe identifies AEs using critical hidden features rather than the entire feature space. We observe a non-uniform distribution of Z-scores between clean samples and adversarial examples (AEs) among hidden features, and propose two utility functions to select those most relevant to AEs.We process critical hidden features using feature engineering methods: local outlier factor (LOF), feature squeezing, and whitening, which estimate feature density relative to its k-neighbors, reduce redundancy, and normalize features. To deal with the curse of dimensionality and smooth statistical fluctuations in high-dimensional features, we propose local reachability density (LRD).Our LRD iteratively selects a bag of engineered features with random cardinality and quantifies their average density by its k-nearest neighbors. Lastly, MeetSafe constructs a Gaussian Mixture Model (GMM) with the processed features, and detects AEs if it is seen as a local outlier, shown by a low density from GMM. Experimental results show that MeetSafe achieves 74%, 96% and 79% (see Tab. 2-4 respectively)of detection accuracy against adaptive, classic and white-box attacks respectively and at least 2.3× (see Fig. 2(b)) faster than comparison methods.

Keywords: Adversarial attack, Convolutional Neural Network, Gaussian mixture model, Adversarial example, Local reachability density

Received: 19 May 2025; Accepted: 16 Jul 2025.

Copyright: © 2025 Stenhuis, Liu, Qiao, Conti, Panaousis and Liang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Dazhuang Liu, Delft University of Technology, Delft, Netherlands

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.