Abstract
Real-time weapon detection in video surveillance is a critical capability for artificial intelligence assisted security systems, particularly in scenarios constrained by low latency, limited computational resources, and strict power efficiency requirements typical of edge artificial intelligence deployments. This work presents a comparative analysis of lightweight YOLO based object detectors, namely YOLOv8s, YOLOv9s, YOLOv10s, and YOLOv11s, vs. the recently introduced YOLOv26s model. In contrast to conventional benchmarking studies, this work extends the evaluation to real-world edge deployment conditions using an NVIDIA Jetson Nano device, explicitly measuring end-to-end latency, including preprocessing, inference, and post-processing stages. While earlier YOLO variants primarily relied on convolutional neural network architectures and intermediate explorations such as attention centered designs aimed to improve detection accuracy, YOLOv26 represents a paradigm shift by being designed from the ground up for low power edge devices, emphasizing architectural simplicity and deployment efficiency. To ensure a fair and reproducible evaluation, all models are trained on the same weapon detection dataset under a unified experimental protocol using small scale variants.The experimental results reveal that, despite exhibiting comparable inference times, different models show significantly different real-time performance due to variations in post-processing complexity. Specifically, models such as YOLOv8s, YOLOv9s, and YOLOv11s incur a substantial post-processing overhead, whereas YOLOv10s and YOLOv26s produce compact output representations that drastically reduce post-processing cost.This leads to a clear separation in deployment behavior, where end-to-end latency is reduced from approximately 300 ms to 125–130 ms, effectively doubling the achievable frame rate on embedded hardware. Rather than proposing a universal ranking, the study analyzes the trade offs introduced by architectural evolution and optimization strategies, providing technical criteria to support model selection under resource constrained deployment scenarios and demonstrating that post-processing efficiency, rather than inference speed alone, is the dominant factor in real-time edge performance.
1 Introduction
Video surveillance has become an essential component of modern security systems in both public and private environments, supporting crime prevention, situational awareness, and post event analysis. Nevertheless, conventional closed circuit television systems still rely heavily on continuous human monitoring, which introduces limitations related to operator fatigue, subjective interpretation of scenes, and delayed responses to critical events (Yadav and Yadav, 2025). In this context, the automatic detection of bladed weapons and firearms from video streams is of particular relevance, as early identification can support timely intervention and risk mitigation in high threat scenarios (Keerthana and Yadav, 2024).
Early attempts to automate crime and weapon detection in surveillance footage demonstrated the feasibility of detecting guns and knives in real-time video streams, but these systems typically relied on computationally demanding backbones and multi stage pipelines, limiting scalability for practical deployments (Navalgund and Priyadharshini, 2018). More recently, surveys and systematic reviews have consolidated evidence that convolutional neural networks consistently outperform classic hand crafted feature based methods for weapon identification in images and video sequences, even under adverse conditions such as partial occlusions, illumination changes, and complex backgrounds (Debnath and Debnath, 2021; Sandhu and Bhatia, 2025; Santos et al., 2024).
Despite these accuracy improvements, real world deployment remains challenging due to hardware limitations and the need to balance detection performance with computational cost. Beyond YOLO based solutions, lightweight detectors such as single shot detector MobileNet variants have been explored to reduce inference overhead, highlighting the importance of compact architectures for resource constrained surveillance scenarios (Salim and Che, 2023). In parallel, edge computing has emerged as a key paradigm for next generation surveillance systems by shifting inference closer to the data source, thereby reducing latency and dependence on centralized infrastructures (Rokade et al., 2025; Burnayev et al., 2023).
Within this context, the YOLO, You Only Look Once, family of object detectors has gained widespread adoption in video surveillance applications due to its favorable balance between detection accuracy and inference speed. Early YOLO based weapon detection systems demonstrated feasibility in surveillance environments (Warsi et al., 2019), while subsequent studies emphasized the importance of dataset composition and domain specific data collection strategies for improving robustness in real-time security applications (Bazan et al., 2024). More recent works have extended weapon detection pipelines to newer compact YOLO generations, including YOLOv9 and YOLOv11, demonstrating their applicability in surveillance scenarios under practical computational constraints (Dey and Dey, 2024). In parallel, application oriented systems have validated the effectiveness of modern YOLO based frameworks for weapon detection in operational security contexts (Hsueh and Yang, 2025; Golande et al., 2025).
Complementary benchmark studies have investigated YOLO performance under edge artificial intelligence constraints, showing that compact model configurations provide a more favorable balance between detection effectiveness and computational cost in real-time video surveillance deployments (Intagorn et al., 2025; Berardini et al., 2024). Motivated by these observations, this paper presents a comparative analysis of compact YOLO architectures, focusing on YOLOv8s, YOLOv9s, YOLOv10s, and YOLOv11s, evaluated against the recently introduced YOLOv26s model under a unified and reproducible experimental protocol for weapon detection. Unlike prior studies that prioritize larger scale models or heterogeneous configurations, the present work deliberately concentrates on small variants to reflect realistic deployment conditions on resource constrained platforms. Rather than identifying a universally best performing detector, the analysis aims to characterize the trade offs between detection accuracy, architectural complexity, and computational efficiency, providing technical criteria to support informed model selection for edge based intelligent video surveillance systems.
However, most existing comparative studies focus primarily on model level metrics such as accuracy, precision, recall, or inference time, while overlooking the impact of the complete detection pipeline on real-world deployment performance. In practical edge scenarios, object detection systems operate as multi-stage pipelines, where preprocessing, inference, and post-processing jointly determine the overall latency and responsiveness of the system. In particular, the role of post-processing operations, such as candidate filtering and suppression mechanisms, remains underexplored in the literature, despite their significant computational cost in real-time applications. As a result, models with similar inference performance may exhibit substantially different end-to-end behavior when deployed on resource-constrained hardware.
To address this gap, this work extends the conventional benchmarking perspective by incorporating a deployment-oriented evaluation on an NVIDIA Jetson Nano device. The study explicitly measures end-to-end latency and analyzes the contribution of each stage of the detection pipeline, enabling a more realistic assessment of model performance in edge-based surveillance systems. Furthermore, the experimental analysis reveals the existence of two distinct categories of object detectors based on their output representation and post-processing requirements: (i) models relying on dense predictions with costly post-processing, and (ii) models producing compact detection outputs that significantly reduce post-processing overhead. This distinction provides new insights into the factors that govern real-time performance beyond traditional accuracy-based comparisons.
2 State of the art
2.1 Weapon detection using deep learning
Automatic weapon detection in video surveillance systems has attracted increasing attention due to the growing demand for intelligent security solutions capable of operating in complex and uncontrolled environments. Comprehensive surveys and systematic reviews consistently report that deep learning based approaches, particularly convolutional neural networks, significantly outperform traditional hand crafted feature based methods for weapon detection in both images and video streams (Bhandari et al., 2025; Chaware et al., 2025). These advantages are especially evident under challenging conditions such as partial occlusions, illumination variations, cluttered backgrounds, and viewpoint changes.
Despite the promising results reported on curated image datasets, multiple studies indicate that detection performance degrades when models are deployed on real closed circuit television footage. Experimental analyses and systematic reviews show that surveillance videos are affected by intrinsic visual degradations, including low spatial resolution, motion blur, compression artifacts, and suboptimal camera viewpoints, which collectively hinder the reliable detection of small or partially occluded objects such as weapons (Murugan et al., 2025; Beca et al., 2026). Furthermore, evidence from real world surveillance applications highlights that these factors introduce domain shifts that are not adequately captured by web based imagery or laboratory controlled datasets (Duggimpudi et al., 2026). These findings emphasize the importance of evaluating detection models under realistic surveillance conditions to obtain deployment relevant performance estimates.
Recent research also highlights the decisive role of dataset composition and annotation quality in the reliability of weapon detection systems. Empirical studies demonstrate that class imbalance and background bias can significantly degrade detection performance, particularly for rare or visually ambiguous classes, leading to increased false positive rates across application domains (Khanam et al., 2025). In parallel, advances in hard negative mining show that visually similar but semantically distinct objects represent a major source of false detections if not explicitly addressed during training (Li et al., 2025). Consequently, careful dataset curation, consistent annotation protocols, and explicit treatment of hard negatives are widely recognized as essential practices for improving robustness in safety critical detection scenarios.
2.2 YOLO architectures in video surveillance
Among one stage object detectors, YOLO, You Only Look Once, architectures have been widely adopted in video surveillance and security related applications due to their favorable balance between detection accuracy and real-time inference speed. Early studies focusing on weapon detection in surveillance scenarios demonstrated that YOLO based models can achieve competitive accuracy while enabling real-time operation (Mukto et al., 2024). Subsequent evaluations under challenging surveillance conditions, such as low illumination and visually degraded scenes, further highlighted the robustness and computational efficiency of YOLO architectures compared to heavier two stage approaches (Al-Refai et al., 2025; Liang et al., 2025).
The introduction of YOLOv8 marked a notable evolution within the YOLO family, as multiple studies reported improvements in training stability, architectural modularity, and computational efficiency, facilitating deployment in practical computer vision applications and resource constrained environments (Farhan et al., 2025; Krishna and Poonkodi, 2026). Building upon this foundation, YOLOv9 and YOLOv10 introduced additional architectural refinements aimed at improving feature representation and optimization efficiency, and have been predominantly evaluated in domains such as traffic monitoring, intelligent transportation systems, and industrial inspection (Chaman et al., 2026). More recently, YOLOv11 has been included in homogeneous benchmarking studies alongside previous YOLO generations, demonstrating competitive trade offs between detection accuracy, inference speed, and computational cost across real-time detection tasks (da Luz et al., 2026).
However, despite these advances, systematic and domain specific evaluations of YOLOv8, YOLOv9, YOLOv10, and YOLOv11 for weapon detection in real world video surveillance scenarios remain limited. Moreover, most existing studies focus on a single YOLO version, and direct comparisons across multiple YOLO generations under identical datasets and training protocols are still scarce in the weapon detection literature.
2.3 Comparative studies and performance analysis
Recent advances in weapon detection have increasingly focused on optimizing both detection accuracy and computational efficiency under real-world surveillance constraints. In this context, several studies have conducted comparative evaluations using specialized weapon detection datasets and application-oriented scenarios.
A notable contribution is the YOLO-GTWDNet model, a lightweight YOLOv8-based architecture incorporating a GhostNet backbone and a transformer-based neck for enhanced feature representation (Zhang et al., 2019). This model was evaluated on the Weapon7 dataset, which includes multiple weapon classes such as guns, knives, and blunt weapons collected from heterogeneous sources. Experimental results demonstrate that the proposed architecture outperforms conventional YOLO-based models in both quantitative metrics and qualitative detection robustness, particularly under challenging conditions such as small object size, occlusion, and illumination variability. These findings highlight the impact of integrating lightweight feature extractors and attention mechanisms to improve detection performance while maintaining real-time capability.
Complementary research has explored weapon detection in non-conventional imaging modalities. For instance, thermal imaging-based approaches have been proposed to detect concealed firearms in low-visibility environments (Kang et al., 2017). In this work, a hybrid deep learning framework combining convolutional neural networks and YOLOv3 achieved an F1-score of 0.84 and a mean Average Precision (mAP) of 0.95, with inference times of approximately 10 ms per frame. This demonstrates that multimodal sensing can significantly enhance detection reliability in scenarios where traditional RGB-based surveillance may fail.
In addition to architectural innovations, comparative studies have evaluated different object detection frameworks for security applications. Howard et al. (2019) conducted a systematic comparison of YOLO, SSD, and Faster R-CNN for crime-related object detection, including firearms and bladed weapons. Their results indicate that YOLO-based models provide the best trade-off between detection accuracy and inference speed, making them particularly suitable for real-time surveillance systems. Specifically, YOLOv5 was selected due to its ability to maintain high mAP while significantly reducing computational latency compared to two-stage detectors.
From a performance evaluation perspective, these studies consistently emphasize the importance of multiple metrics beyond mean Average Precision. Precision, recall, F1-score, and inference latency are critical indicators in safety-critical environments. In particular, minimizing false positives is essential to avoid alarm fatigue and ensure system reliability in operational deployments. Reported results across weapon detection systems typically achieve precision values above 90% and recall values between 80% and 95%, depending on the dataset complexity and environmental conditions.
In this context, datasets specifically designed for weapon detection, such as those proposed by Pérez-Hernández et al. (2020), play a crucial role by incorporating hard negative samples that include objects visually similar to weapons (e.g., mobile phones, tools, or everyday items). These hard negatives are essential for improving model robustness and reducing false positive rates, particularly in real-world surveillance scenarios where visual ambiguity is common.
Despite these advances, the literature still lacks standardized benchmarking protocols for weapon detection. Variations in dataset composition, annotation quality, and evaluation procedures limit the direct comparability of results across studies. This highlights the need for unified evaluation frameworks that incorporate both accuracy and efficiency metrics under realistic surveillance conditions.
2.4 Deployment considerations
While detection accuracy remains a primary objective in most studies, relatively few works explicitly address deployment constraints associated with edge artificial intelligence platforms. Factors such as model size, memory footprint, power consumption, and inference latency are frequently overlooked, despite their decisive importance for real-time video surveillance systems deployed at the edge (Zhang et al., 2019; Kang et al., 2017).
Recent edge-oriented investigations demonstrate that compact detector variants often achieve more favorable accuracy efficiency trade offs compared to larger models when deployed on resource constrained hardware. These findings highlight the necessity of jointly considering detection performance and computational efficiency when selecting models for edge based surveillance applications (Howard et al., 2019; Liu et al., 2020).
2.5 Limitations of existing work and motivation
In summary, the existing literature exhibits several limitations. Many studies rely on heterogeneous datasets, inconsistent training configurations, or non uniform evaluation metrics, which hinder objective comparison between detection models (Liu et al., 2020). Furthermore, only a limited number of works provide systematic comparisons across multiple YOLO generations under strictly controlled experimental conditions, particularly in the context of weapon detection in video surveillance environments (Murugan et al., 2025).
These limitations motivate a comprehensive comparative analysis of compact YOLO architectures trained and evaluated under homogeneous conditions. In particular, the absence of dedicated comparisons between established compact models, such as YOLOv8s, YOLOv9s, YOLOv10s, and YOLOv11s, and newer edge-oriented designs such as YOLOv26s represents a clear gap in the literature. Addressing this gap through a unified experimental protocol enables a more objective assessment of performance trade offs and provides practical guidance for selecting detection models suitable for real world, edge based surveillance deployments.
3 Materials and methods
The proposed methodology is designed to enable a fair, transparent, and reproducible comparative evaluation of compact YOLO architectures for weapon detection, with particular emphasis on their suitability for edge-oriented deployment. The overall experimental workflow, encompassing dataset preparation, model training, and performance evaluation, is illustrated in Figure 1. All evaluated models, namely YOLOv8s, YOLOv9s, YOLOv10s, YOLOv11s, and the recently introduced YOLOv26s, were trained and evaluated using the same dataset, identical data partitions, and a unified experimental protocol. This design ensures that observed performance differences can be attributed primarily to architectural characteristics rather than variations in training conditions or computational resources.
Figure 1
To ensure consistency and reproducibility, all models were trained under identical conditions using a single NVIDIA Quadro RTX 4000 graphics processing unit. However, in order to evaluate real-world deployment performance, additional inference experiments were conducted on an NVIDIA Jetson Nano device, representing a resource-constrained edge computing platform. This dual setup allows separating training efficiency from deployment behavior and enables a realistic assessment of model performance in embedded environments.
3.1 Dataset
3.1.1 Dataset description
The experimental evaluation was conducted using the Sohas Weapon Detection Dataset, obtained from the OD WeaponDetection repository (Pérez-Hernández et al., 2020). This dataset contains annotated images depicting firearms and bladed weapons under diverse conditions representative of real world video surveillance scenarios, including variations in illumination, background complexity, object scale, and partial occlusions.
The dataset comprises a total of 5,859 annotated images, partitioned into training, validation, and test subsets. Specifically, 4,252 images were used for training, 750 images for validation, and 857 images for testing. Each image is associated with a corresponding annotation file in YOLO format, ensuring a one to one correspondence between images and labels across all subsets and enabling seamless integration with modern YOLO training pipelines. Table 1 presents the distribution of instances across all classes in the training dataset.
Table 1
| Class ID | Class name | Number of instances |
|---|---|---|
| 0 | Pistol | 1,510 |
| 1 | Smartphone | 715 |
| 2 | Knife | 2,277 |
| 3 | Monedero | 601 |
| 4 | Billete | 477 |
| 5 | Tarjeta | 279 |
Class distribution in the training dataset.
As observed, the dataset presents an imbalanced distribution across classes. This imbalance is intentional and reflects realistic surveillance conditions, where non-weapon objects are included as hard negative samples to improve the robustness of the detector against false positives.
In addition to weapon classes, namely pistol and knife, the dataset explicitly includes several non-weapon categories corresponding to visually similar handheld objects such as smartphones, wallets, banknotes, and cards, as well as a background class. These non-weapon categories act as hard negatives, deliberately introducing visually ambiguous samples that are known to increase false positive rates in surveillance based weapon detection systems.
The inclusion of hard negative classes enables a more realistic and safety oriented evaluation by assessing a model's ability to discriminate weapons from everyday objects that may resemble them under low spatial resolution, partial occlusion, or unfavorable viewing conditions. This characteristic is particularly relevant for real world deployments, where minimizing false alarms is often more critical than maximizing raw detection recall.
The original dataset was provided in an earlier YOLO format and did not include an explicit validation split. Therefore, a dedicated validation subset was created, along with the corresponding annotation files, to ensure compatibility with recent YOLO implementations and to support a fair and reproducible experimental protocol. The same fixed data partitions were used consistently across all experiments to prevent data leakage and to guarantee direct comparability among all evaluated YOLO architectures, including YOLOv8s, YOLOv9s, YOLOv10s, YOLOv11s, and YOLOv26s.
Given the safety critical nature of weapon detection, particular attention was paid to annotation consistency and to the presence of visually ambiguous samples, such as handheld objects that may resemble weapons under low resolution or partial occlusion. Any data augmentation techniques applied during training were used uniformly across all model variants to avoid introducing model specific biases. Representative annotated examples from the dataset are shown in Figure 2.
Figure 2
3.1.2 Data preprocessing
Prior to training, all images were resized to a fixed input resolution of 640 × 640 pixels while preserving the original aspect ratio through letterboxing, following the default configuration adopted by the evaluated YOLO architectures. This input resolution was selected as a balanced compromise between detection accuracy and computational efficiency, and was applied consistently across all experiments. Pixel values were normalized internally by the training framework, and all annotations were maintained in the standard YOLO format using normalized bounding box coordinates.
During training, data augmentation was applied exclusively to the training subset using the default augmentation strategies provided by the Ultralytics YOLO framework. These built in augmentations include random brightness and contrast adjustments, hue saturation value color space perturbations, horizontal flipping, and moderate geometric transformations, which are commonly employed to improve generalization and reduce overfitting in object detection tasks.
The intensity of all augmentation operations was kept at the framework default levels, deliberately avoiding aggressive transformations that could distort small objects or alter visually ambiguous patterns. This design choice is particularly relevant given the presence of hard negative classes in the dataset, where excessive augmentation could artificially modify the visual similarity between weapons and non-weapon objects, potentially biasing the learning process.
All preprocessing and augmentation strategies were applied uniformly across all evaluated models, including YOLOv8s, YOLOv9s, YOLOv10s, YOLOv11s, and the edge-oriented YOLOv26s. The validation and test subsets were left completely unmodified to ensure objective and consistent performance evaluation. By enforcing identical preprocessing pipelines for all model variants, the experimental design isolates architectural and optimization differences as the primary sources of performance variation, enabling a fair and reproducible comparison between established compact YOLO models and the newer edge optimized YOLOv26s.
3.2 Training configuration
All YOLO models were trained under a unified and controlled experimental setup to ensure a fair and reproducible comparison across different architectural generations. The same dataset splits, input resolution, and optimization strategy were applied consistently to all evaluated architectures, including YOLOv8s, YOLOv9s, YOLOv10s, YOLOv11s, and the edge-oriented YOLOv26s. This design isolates architectural and optimization differences as the primary sources of performance variation.
Each model was initialized from publicly available pretrained weights and fine tuned on the Sohas Weapon Detection Dataset for 50 epochs. The input image resolution was fixed at 640 × 640 pixels, and a batch size of 16 was used uniformly across all experiments. All training runs employed the default optimization strategy and loss configurations provided by the Ultralytics YOLO framework, without applying any model specific hyperparameter tuning or manual optimization.
To accelerate experimentation without introducing training bias, each model was trained independently using a single graphics processing unit. Although multiple GPUs were available on the server, no model utilized multi GPU or distributed training. Instead, different GPUs were assigned to different YOLO variants to enable parallel execution while preserving identical computational conditions per model. This approach ensures experimental fairness, reproducibility, and consistency across all evaluated architectures. The unified training configuration adopted for all models is summarized in Table 2.
Table 2
| Paramete | Value |
|---|---|
| Input resolution | 640 × 640 |
| Batch size | 16 |
| Epochs | 50 |
| Optimizer | Default ultralytics |
| Pretrained weights | Yes |
| Data augmentation | Default ultralytics |
| Hardware | NVIDIA Quadro RTX 4000 |
| Training strategy | Single GPU per model |
Unified training configuration used for all evaluated models.
3.3 Evaluation protocol
Model performance was evaluated exclusively on a held out test subset that was not used during training or validation. This evaluation strategy provides an unbiased assessment of detection performance under consistent and reproducible conditions. All evaluations were conducted using a fixed input resolution of 640 × 640 pixels and identical inference settings across all YOLO variants, including YOLOv8s through YOLOv26s.
Inference was performed using a single GPU per model to maintain consistent computational conditions and to avoid variability introduced by multi GPU execution. No test time augmentation or post processing beyond the default non maximum suppression implemented in the Ultralytics YOLO framework was applied. This ensures that observed performance differences reflect intrinsic model behavior rather than evaluation specific optimizations.
Given the safety critical nature of weapon detection, evaluation focused not only on overall detection accuracy but also on class wise behavior and false positive tendencies. Particular attention was paid to misclassifications involving visually similar non-weapon objects, referred to as hard negatives, such as smartphones, wallets, and cards, as these errors are especially detrimental in real world surveillance deployments.
All models were evaluated using the same test images and annotation files, ensuring that performance differences across YOLOv8s, YOLOv9s, YOLOv10s, YOLOv11s, and YOLOv26s are directly attributable to architectural differences rather than dataset composition or evaluation protocol variations.
3.3.1 Edge deployment evaluation
In addition to conventional evaluation metrics, a deployment-oriented analysis was conducted using an NVIDIA Jetson Nano device to assess real-time performance under embedded conditions. Unlike traditional benchmarking approaches that focus solely on inference time, this evaluation measures the complete detection pipeline, including preprocessing, inference, and post-processing stages. This allows a more realistic characterization of system latency in practical surveillance scenarios.
To ensure consistency and reproducibility, all models were executed under identical runtime conditions, processing live video streams using a fixed input resolution of 640 × 640. A scenario-based evaluation protocol was adopted instead of a class-exhaustive approach. Three representative scenarios were defined: (i) weapon detection scenario (pistol), (ii) common object scenario prone to false positives (smartphone), and (iii) visually similar object scenario (wallet).
These scenarios capture the most critical behaviors of surveillance systems, including detection reliability, false alarm reduction, and robustness to ambiguous objects. For each scenario, the following metrics were recorded:
Preprocessing time (ms).
Inference time (ms).
Post-processing time (ms).
Total end-to-end latency (ms).
Frames per second (FPS).
This evaluation enables the identification of performance bottlenecks across different YOLO architectures and provides insights into their suitability for real-time edge deployment.
3.4 Evaluation metrics
Model performance was assessed using a set of complementary metrics designed to capture both detection accuracy and operational reliability in safety critical video surveillance scenarios. Standard object detection metrics were combined with class wise and efficiency oriented indicators to enable a comprehensive comparison across YOLO architectures.
Precision, recall, and F1 score were computed to analyze the trade off between correct detections and false alarms. Precision is particularly relevant for weapon detection tasks, as false positives may trigger unnecessary alerts and reduce system trustworthiness in real world deployments. Recall measures the ability of a model to correctly identify weapon instances, while the F1 score provides a balanced assessment of both aspects.
Mean Average Precision was reported using standard intersection over union (IoU) thresholds, following widely adopted evaluation protocols in object detection benchmarks such as MS COCO (Lin et al., 2014). Specifically, mAP@0.5 was used to evaluate detection performance under a relaxed localization criterion, while mAP@0.5:0.95 reflects more stringent localization requirements. Although mean Average Precision is widely adopted for benchmarking, it does not fully capture the operational impact of false positives and is therefore interpreted alongside precision oriented metrics.
To further analyze model behavior, class wise performance and confusion matrices were examined, with particular attention to misclassifications involving hard negative classes. This analysis provides insight into the tendency of each architecture, including YOLOv26s, to confuse weapons with visually similar everyday objects.
Finally, inference efficiency was considered as an indirect indicator of suitability for edge-oriented deployment. While absolute latency depends on target hardware, relative inference behavior measured under identical conditions provides a consistent proxy for comparing computational efficiency across YOLO generations.
3.5 Weapon detection models
This study evaluates a set of compact one stage object detection architectures from the YOLO family, selected for their suitability in real-time video surveillance and edge-oriented deployments. All considered models follow the one stage detection paradigm, directly predicting bounding boxes and class probabilities in a single forward pass, which makes them particularly appropriate for time critical scenarios and resource constrained environments.
The experimental comparison focuses on four established compact YOLO generations, namely YOLOv8s, YOLOv9s, YOLOv10s, and YOLOv11s, which are treated as representative baselines of convolution based YOLO evolution. These models are evaluated against the recently introduced YOLOv26s architecture, which has been explicitly designed for low power edge devices and optimized deployment scenarios. This design enables a direct and controlled comparison between prior compact YOLO variants and a newer edge-oriented architecture under identical experimental conditions.
3.5.1 YOLO architecture overview
YOLO, You Only Look Once, is a one stage object detection framework that formulates object detection as a single regression problem, jointly predicting object locations and class probabilities from the entire image. By avoiding explicit region proposal mechanisms, YOLO based models achieve low inference latency and simplified deployment pipelines compared to two stage detectors.
Across successive generations, YOLO architectures have incorporated architectural refinements aimed at improving feature representation, training stability, and optimization efficiency, while maintaining a strong emphasis on real-time inference. These refinements include improved backbone designs, enhanced feature aggregation strategies, and decoupled detection heads, which collectively contribute to improved performance efficiency trade offs across different deployment scenarios.
3.5.2 Evaluated model variants
The experimental evaluation includes compact variants of five YOLO architectures, namely YOLOv8s, YOLOv9s, YOLOv10s, YOLOv11s, and YOLOv26s. All models were initialized using publicly available pretrained weights provided by the Ultralytics framework and subsequently fine tuned on the same weapon detection dataset under identical training and evaluation conditions.
The selection of small and compact variants was a deliberate design choice aimed at reflecting realistic deployment constraints encountered in edge based video surveillance systems. By restricting the comparison to models with comparable parameter scales and computational complexity, the analysis isolates architectural and optimization differences as the primary factors influencing detection performance.
Table 3 summarizes the main architectural characteristics of the evaluated YOLO variants. Parameter counts and computational costs correspond to official Ultralytics documentation evaluated on the MS COCO benchmark with an input resolution of 640 × 640 and are reported for reference purposes only (Ultralytics, 2023, 2024b,a, 2025a,b). These values are provided to contextualize relative model complexity and are not used as performance indicators for the weapon detection task addressed in this study.
Table 3
| Model | Parameters (M) | FLOPs (B) |
|---|---|---|
| YOLOv8s | 11.2 | 28.6 |
| YOLOv9s | 7.2 | 26.7 |
| YOLOv10s | 7.2 | 21.6 |
| YOLOv11s | 9.4 | 21.5 |
| YOLOv26s | 9.5 | 18.9 |
General characteristics of the evaluated YOLO model variants with an input resolution of 640 × 640.
4 Results
4.1 Overall detection performance
This section presents a comparative evaluation of compact YOLO based object detection architectures, including YOLOv8s, YOLOv9s, YOLOv10s, YOLOv11s, and the recently introduced YOLOv26s. All models were trained and evaluated under identical experimental conditions, and all results reported in this section were obtained on the held out test subset following the evaluation protocol described in Section 3.4. This unified setup ensures a fair and reproducible comparison across model generations and architectural design paradigms.
Table 4 summarizes the overall detection performance in terms of precision, recall, F1-score, and mean Average Precision (mAP). While mAP metrics provide a general indication of detection and localization accuracy, greater emphasis is placed on precision and F1-score due to the safety critical nature of weapon detection, where false positives may trigger unnecessary alerts and significantly degrade system reliability in operational surveillance environments.
Table 4
| Model | Precision | Recall | F1-score | mAP@0.5 | mAP@0.5:0.95 |
|---|---|---|---|---|---|
| YOLOv8s | 0.899 | 0.892 | 0.895 | 0.936 | 0.811 |
| YOLOv9s | 0.876 | 0.907 | 0.891 | 0.953 | 0.837 |
| YOLOv10s | 0.919 | 0.869 | 0.893 | 0.953 | 0.825 |
| YOLOv11s | 0.908 | 0.897 | 0.903 | 0.944 | 0.827 |
| YOLOv26s | 0.913 | 0.920 | 0.916 | 0.954 | 0.847 |
Overall detection performance on the test set.
Overall, all evaluated YOLO variants achieve competitive detection performance, albeit with noticeable trade offs between precision and recall. YOLOv10s and YOLOv11s exhibit higher precision values, indicating more conservative detection behavior and improved false positive control, whereas YOLOv9s achieves higher recall at the expense of increased false alarms. These trends reflect architectural evolution within convolution based YOLO generations and differing optimization priorities.
Notably, YOLOv26s achieves detection performance comparable to, and in some metrics exceeding, that of the established small YOLO variants, particularly in terms of mAP. At epoch 50, YOLOv26s attains a mAP@0.5 of 0.954 and a mAP@0.5:0.95 of 0.847, demonstrating strong localization accuracy despite its edge-oriented design. Unlike previous YOLO “small” variants, YOLOv26s is not strictly derived from a scaled down convolutional backbone, but rather represents a design explicitly optimized for edge deployment. This distinction makes direct one to one comparisons non trivial, yet the observed results indicate that YOLOv26s preserves competitive detection effectiveness while targeting improved computational efficiency.
Figure 3 presents the confusion matrices obtained on the test set for all evaluated YOLO variants, providing a detailed view of class wise detection behavior under realistic surveillance conditions. Unlike aggregate metrics, confusion matrices allow direct inspection of misclassification patterns, particularly those involving visually similar nonweapon objects that act as hard negatives.
Figure 3
Across all models, weapon classes (pistol and knife) exhibit high true positive rates, confirming the ability of compact YOLO architectures to capture discriminative weapon related visual features. However, non negligible confusion persists between weapons and certain everyday handheld objects, such as smartphones, wallets, and banknotes, which share similar geometric and appearance cues in low resolution or partially occluded scenes.
A comparative analysis reveals that newer architectures, especially YOLOv10s, YOLOv11s, and YOLOv26s, demonstrate improved discrimination between weapon and non-weapon classes, reflected by reduced false positive activations on hard negative categories. In contrast, earlier variants show a more permissive detection behavior, leading to higher rates of false alarms. These observations highlight that architectural evolution not only impacts overall accuracy but also significantly affects operational reliability, which is critical for deployment in real world surveillance systems.
4.2 Edge deployment performance on Jetson Nano
In addition to conventional evaluation metrics, a deployment-oriented analysis was conducted on an NVIDIA Jetson Nano device to assess real-time performance under embedded conditions.
Unlike traditional benchmarking approaches that focus solely on inference time, this evaluation measures the complete detection pipeline, including preprocessing, inference, and post-processing stages. This enables a realistic characterization of model behavior in operational surveillance scenarios, where total latency rather than isolated inference time determines deployment viability.
Table 5 summarizes the end-to-end latency and throughput for all evaluated models across three representative scenarios: smartphone, pistol, and wallet. A clear separation emerges between two groups of detectors. YOLOv8s, YOLOv9s, and YOLOv11s exhibit relatively similar inference times to the other models, but suffer from a substantial post-processing overhead, which increases total latency to approximately 296–315 ms and limits throughput to around 3 FPS. In contrast, YOLOv10s and YOLOv26s maintain post-processing costs close to 2 ms, resulting in total latencies around 124–130 ms and throughput above 7.4 FPS.
Table 5
| Model | Scenario | Pre (ms) | Infer (ms) | Post (ms) | Total (ms) | FPS |
|---|---|---|---|---|---|---|
| YOLOv26s | Smartphone | 15.37 | 107.40 | 2.24 | 125.02 | 7.71 |
| YOLOv26s | Pistol | 16.16 | 107.35 | 2.16 | 125.67 | 7.70 |
| YOLOv26s | Wallet | 15.08 | 106.86 | 2.22 | 124.17 | 7.76 |
| YOLOv10s | Smartphone | 15.28 | 112.87 | 2.21 | 130.37 | 7.41 |
| YOLOv10s | Pistol | 15.05 | 112.72 | 2.27 | 130.04 | 7.44 |
| YOLOv10s | Wallet | 15.37 | 112.91 | 2.25 | 130.53 | 7.41 |
| YOLOv8s | Smartphone | 14.84 | 105.23 | 175.46 | 295.53 | 3.34 |
| YOLOv8s | Pistol | 14.82 | 105.26 | 181.28 | 301.37 | 3.27 |
| YOLOv8s | Wallet | 14.98 | 105.24 | 175.45 | 295.68 | 3.33 |
| YOLOv9s | Smartphone | 15.20 | 117.55 | 177.43 | 310.18 | 3.18 |
| YOLOv9s | Pistol | 15.36 | 116.44 | 183.56 | 315.36 | 3.13 |
| YOLOv9s | Wallet | 15.02 | 115.78 | 176.22 | 307.02 | 3.21 |
| YOLOv11s | Smartphone | 14.90 | 107.02 | 176.65 | 298.59 | 3.30 |
| YOLOv11s | Pistol | 14.99 | 106.90 | 180.09 | 301.98 | 3.27 |
| YOLOv11s | Wallet | 15.02 | 107.06 | 176.61 | 298.70 | 3.30 |
End-to-end latency comparison on Jetson Nano across representative scenarios.
These results indicate that end-to-end deployment performance is dominated by post-processing complexity rather than inference alone. The Jetson Nano experiments therefore reveal a deployment behavior that cannot be inferred from conventional benchmark metrics on desktop GPUs.
As shown in Figure 4, the total latency on the Jetson Nano can be decomposed into preprocessing, inference, and post-processing stages for each evaluated model. The figure shows that post-processing dominates the overall latency in YOLOv8s, YOLOv9s, and YOLOv11s, accounting for the largest portion of execution time. In contrast, YOLOv10s and YOLOv26s exhibit minimal post-processing overhead, resulting in significantly lower end-to-end latency. This behavior explains the substantial differences in real-time performance despite similar inference times across models.
Figure 4
4.3 Scenario-based semantic analysis
To complement the latency analysis, a scenario-based evaluation was conducted to assess the semantic behavior of the models under representative surveillance conditions. Rather than performing an exhaustive per-class evaluation, three critical scenarios were selected: (i) weapon detection (pistol), (ii) common object scenario prone to false positives (smartphone), and (iii) visually similar object scenario (wallet). This design captures the most relevant operational challenges in surveillance systems, namely threat detection, false alarm reduction, and discrimination against hard negatives.
As shown in Figure 5, YOLOv10s and YOLOv26s achieve more than twice the throughput, measured in frames per second (FPS), compared with YOLOv8s, YOLOv9s, and YOLOv11s. This improvement is directly associated with the reduced post-processing cost, which enables more efficient real-time performance on embedded hardware.
Figure 5
As shown in Figure 6, post-processing accounts for more than 50% of the total execution time in YOLOv8s, YOLOv9s, and YOLOv11s, whereas its contribution is negligible in YOLOv10s and YOLOv26s. This result highlights that post-processing, rather than inference, is the primary bottleneck in real-time edge deployment.
Figure 6
Table 6 summarizes the scenario-based semantic behavior on Jetson Nano. All evaluated models successfully detected the pistol class, confirming their capability to identify weapon instances under controlled conditions. However, relevant differences emerge in the non-weapon scenarios. YOLOv8s produces false weapon detections in the smartphone scenario, whereas YOLOv9s, YOLOv10s, YOLOv11s, and YOLOv26s avoid false alarms in that setting. In the wallet scenario, YOLOv26s shows a small number of false positives, while the remaining models maintain zero false weapon activations.
Table 6
| Model | Smartphone (false positives) | Pistol (detections) | Wallet (false positives) |
|---|---|---|---|
| YOLOv26s | 0 | 166 | 2 |
| YOLOv10s | 0 | 191 | 0 |
| YOLOv8s | 3 | 181 | 0 |
| YOLOv9s | 0 | 193 | 0 |
| YOLOv11s | 0 | 178 | 0 |
Scenario-based detection behavior on Jetson Nano.
These results indicate that the deployment suitability of a detector is not determined exclusively by its throughput, but also by its robustness against false alarms in semantically ambiguous situations. In practical surveillance systems, this property is critical because false positives can trigger unnecessary interventions and reduce operator trust in the system.
4.4 Comparison of training dynamics and optimization behavior
The following results describe the evolution of training and validation losses, together with precision, recall, and mean Average Precision metrics for all evaluated compact YOLO architectures. Although each model follows the same experimental protocol, notable differences can be observed in convergence speed, loss smoothness, and metric stabilization, reflecting architectural and optimization related characteristics.
4.4.1 Loss convergence analysis
Figure 7 presents across all models, the training losses (train/box_loss, train/cls_loss, and train/dfl_loss) exhibit a consistent decreasing trend, confirming stable optimization under the unified training configuration. Bounding box loss decreases progressively throughout training, with a clear two phase behavior: a rapid reduction during the first 10-15 epochs, followed by a slower refinement phase. This pattern is consistently observed across YOLOv8s, YOLOv9s, YOLOv10s, YOLOv11s, and YOLOv26s, indicating comparable localization learning dynamics.
Figure 7
Classification loss shows the steepest initial decrease for all models, suggesting fast adaptation to class level discrimination. However, earlier architectures such as YOLOv8s and YOLOv9s display slightly higher variance during the first epochs, whereas newer models, particularly YOLOv10s, YOLOv11s, and YOLOv26s, demonstrate smoother and more regular loss trajectories. This behavior indicates improved optimization stability and better conditioned learning dynamics in later generations.
The distribution focal loss (DFL), which directly influences bounding box quality, converges more gradually. YOLOv26s exhibits one of the smoothest DFL decay patterns, with reduced oscillations in both training and validation curves, suggesting enhanced robustness in localization refinement under edge-oriented design constraints.
4.4.2 Training vs. validation consistency
Validation losses closely follow their training counterparts for all models, with no significant divergence observed across epochs, as shown in Figure 8. This close alignment between training and validation curves indicates limited overfitting and confirms that the selected training duration of 50 epochs is sufficient for convergence under the adopted experimental configuration.
Figure 8
Minor fluctuations in the validation distribution focal loss (DFL) can be observed across all models, particularly during intermediate epochs. These variations are primarily attributed to the presence of small objects and visually ambiguous hard negative classes in the dataset, which introduce additional localization uncertainty during training.
Notably, YOLOv26s maintains one of the smallest gaps between training and validation losses throughout the optimization process reflecting improved generalization behavior despite its reduced computational footprint. This characteristic is particularly relevant for edge deployment scenarios, where robustness and stability are often prioritized over marginal gains in absolute accuracy.
4.4.3 Precision and recall evolution
Precision and recall metrics increase rapidly during the early stages of training and stabilize after approximately 15–20 epochs for all evaluated architectures, as illustrated in Figure 9. Precision curves exhibit a gradual and steady increase, eventually converging near or above 0.90 for the most recent models, indicating increasingly confident decision boundaries as training progresses.
Figure 9
YOLOv10s and YOLOv11s show smoother precision trajectories with fewer abrupt oscillations compared to earlier architectures, reflecting improved stability during optimization. This behavior is consistent with their more conservative detection profiles and reduced false positive tendencies observed in the class wise confusion matrix analysis.
In the case of YOLOv26s, precision rises sharply during the initial epochs and stabilizes early, maintaining consistently high values with minimal variance throughout the remainder of training. Recall for YOLOv26s follows a similar trend, rapidly increasing during early optimization and reaching a stable plateau without significant oscillations. This balanced and stable evolution of precision and recall indicates that YOLOv26s achieves robust detection performance while maintaining controlled false positive rates, despite its edge-oriented and computationally efficient design.
4.4.4 Mean average precision trends
The evolution of metrics/mAP@0.5 and metrics/mAP@0.5:0.95 further confirms progressive improvements in both detection confidence and localization accuracy across training epochs, as shown in Figure 10. All models achieve rapid mAP gains during the initial training phase, followed by incremental improvements that stabilize toward the final epochs.
Figure 10
As expected, the stricter localization metric mAP@0.5:0.95 converges more slowly than mAP@0.5, highlighting the increasing difficulty of precise bounding box alignment under more demanding overlap criteria.
YOLOv26s reaches competitive mAP levels while maintaining smooth and stable convergence behavior, demonstrating that its edge-oriented architectural design does not compromise detection effectiveness. On the contrary, the reduced variance observed in its training curves suggests improved optimization efficiency and robustness, which are desirable properties for deployment in resource constrained edge environments.
4.4.5 Comparative interpretation
Overall, the comparative analysis of training dynamics indicates that all evaluated YOLO models converge reliably under identical conditions. Architectural evolution primarily affects convergence smoothness, variance reduction, and generalization stability rather than fundamental learnability. Newer models, especially YOLOv10s, YOLOv11s, and YOLOv26s, exhibit more regular loss decay and smoother metric stabilization, which aligns with their improved discrimination behavior observed in confusion matrix analysis.
These findings confirm that differences in final detection performance are not driven by training instability but by architectural design choices and optimization strategies, reinforcing the fairness and validity of the comparative evaluation.
5 Discussion
5.1 Model comparability and architectural interpretation
Despite the extensive evaluation of compact YOLO architectures, a direct one-to-one comparison between YOLOv26s and previous small variants is inherently constrained by architectural and design differences. Unlike earlier YOLO generations, which follow a well-defined scale hierarchy, YOLOv26 introduces an edge-oriented design paradigm that departs from traditional scale-based categorization. The edge deployment experiments further clarify this distinction. While all evaluated models exhibit relatively comparable inference times, their real-time behavior differs substantially once post-processing is considered. This indicates that the practical suitability of a detector for embedded surveillance depends not only on backbone complexity or FLOPs, but also on the output representation and the cost of candidate filtering at runtime.
From this perspective, the models evaluated in this study can be grouped into two deployment categories: (i) architectures with dense predictions and costly post-processing, including YOLOv8s, YOLOv9s, and YOLOv11s, and (ii) architectures with compact detection outputs and minimal post-processing, including YOLOv10s and YOLOv26s. This distinction constitutes one of the main findings of the revised manuscript. Furthermore, the experimental evidence suggests that conventional indicators such as FLOPs or backbone complexity are insufficient to explain real-time deployment performance. Although all evaluated models present comparable inference times, their end-to-end latency differs substantially due to variations in post-processing cost. This finding highlights that architectural design choices related to output representation and candidate filtering mechanisms have a direct impact on deployment efficiency. Consequently, evaluating object detectors solely based on inference speed or accuracy may lead to misleading conclusions when targeting embedded surveillance systems.
5.2 Qualitative evaluation on test set
In addition to quantitative metrics, qualitative evaluation was conducted on images from the held-out test subset to visually assess detection behavior under realistic surveillance conditions. These images were not used during training or validation and therefore provide an unbiased illustration of model generalization performance. Figure 11 presents representative detection outputs obtained during inference on test images. The examples include correctly detected weapons, challenging cases involving partial occlusions, and scenes containing visually similar non-weapon objects acting as hard negatives.
Figure 11
Overall, the qualitative results are consistent with the quantitative findings. Newer architectures, particularly YOLOv10s, YOLOv11s, and YOLOv26s, exhibit more precise bounding box localization and reduced false positive activations on ambiguous handheld objects. In contrast, earlier models show a higher tendency to trigger detections on visually similar non-weapon items, supporting the class-wise error patterns observed in the confusion matrix analysis.
5.3 Dataset design and class imbalance considerations
An important aspect of the experimental design concerns the class imbalance present in the dataset. This imbalance is not incidental but originates from the original formulation of the Sohas weapon detection problem (Pérez-Hernández et al., 2020), where the primary objective is not only to detect weapons but to discriminate between objects that are visually similar and handled in a similar manner.
In particular, the dataset explicitly includes non-weapon classes such as smartphones, wallets, banknotes, and cards, which act as hard negative samples. These objects are intentionally incorporated because they are frequently confused with weapons in real surveillance scenarios. As reported in prior work, the detection of small objects handled similarly represents a challenging task, where models tend to produce false positives due to similarities in shape, pose, and context.
Furthermore, previous studies have shown that class imbalance is inherent to this problem, as certain classes such as pistols and knives are more consistently represented and easier to learn, while visually similar objects introduce higher ambiguity. This imbalance contributes directly to the complexity of the classification task and reflects realistic operational conditions rather than artificially balanced benchmarks.
From a deployment perspective, maintaining this imbalance is essential to properly evaluate model robustness in safety-critical environments. Artificial balancing techniques, such as oversampling or class-weighted loss adjustments, may lead to optimistic performance estimates but would reduce the realism of the evaluation. In contrast, the adopted dataset configuration enables a more stringent assessment of false positive behavior, which is a critical factor in real-time weapon detection systems.
Therefore, the experimental setup intentionally preserves the original class distribution to ensure that the evaluation reflects real-world surveillance conditions, particularly in scenarios where minimizing false alarms is more critical than achieving uniform class-wise performance.
5.4 Impact of post-processing on real-time performance
The edge deployment results reveal that post-processing is the dominant factor affecting real-time performance on embedded devices. While inference time remains relatively stable across all evaluated models (approximately 105–115 ms), post-processing introduces a significant overhead in certain architectures. Models such as YOLOv8s, YOLOv9s, and YOLOv11s rely on dense prediction outputs followed by filtering and suppression operations, which increase post-processing time to approximately 175–183 ms. As a result, total latency approaches 300 ms, limiting throughput to around 3 FPS.
In contrast, YOLOv10s and YOLOv26s produce compact detection outputs, effectively reducing post-processing time to approximately 2 ms. This reduction leads to total latencies close to 125–130 ms and enables real-time performance above 7 FPS on Jetson Nano. This distinction demonstrates that the main performance bottleneck in real-world deployments is not the inference stage, but the cost of post-processing operations. Therefore, optimizing output representation is critical for achieving efficient edge-based surveillance systems.
5.5 Beyond benchmarking: deployment-oriented insights
Unlike traditional comparative studies that focus primarily on accuracy and inference speed, this work provides a deployment-oriented perspective by analyzing the complete detection pipeline under real-world edge conditions. The results demonstrate that models with similar accuracy and inference performance can behave drastically differently when deployed on resource-constrained hardware. This finding introduces a practical evaluation criterion that complements conventional benchmarks, emphasizing the importance of end-to-end latency over isolated metrics.
From a systems perspective, this study highlights that real-time surveillance applications require a holistic evaluation approach, where preprocessing, inference, and post-processing are jointly considered. This contribution extends beyond standard benchmarking and provides actionable insights for selecting object detection models in edge environments.
5.6 Limitations and future work
Despite the comprehensive evaluation, this study presents certain limitations. First, the analysis focuses on a specific hardware platform (Jetson Nano), and performance may vary across different edge devices with higher computational capabilities.
Second, although three representative scenarios were selected, the evaluation does not cover all possible real-world conditions, such as extreme illumination changes, severe occlusions, or multi-object interactions in crowded scenes.
Future work will extend this analysis to additional edge platforms and incorporate more complex surveillance scenarios, including multi-camera systems and real-time tracking. Furthermore, integrating energy consumption and power efficiency metrics would provide a more complete understanding of deployment trade-offs in edge artificial intelligence systems.
6 Conclusions
This work presented a homogeneous and systematic evaluation of compact YOLO based object detection architectures for real-time weapon detection in video surveillance, with a particular emphasis on assessing the capabilities of the recently introduced YOLOv26s model in edge-oriented deployment scenarios. Established compact YOLO variants (YOLOv8s, YOLOv9s, YOLOv10s, and YOLOv11s) were used as reference baselines to contextualize the performance and behavioral characteristics of YOLOv26s under identical experimental conditions.
All models were trained and evaluated using the same dataset, data partitions, training configuration, and computational resources, with final performance assessed exclusively on a held out test set. This unified experimental design ensured that observed differences in detection accuracy, convergence behavior, and class wise performance could be attributed primarily to architectural design choices rather than experimental bias.
The experimental results indicate that YOLOv26s achieves detection performance comparable to, and in some metrics exceeding, that of established compact YOLO variants, despite not being a conventional “small” scaled version derived from earlier convolution based architectures. In particular, YOLOv26s demonstrates competitive mAP values while maintaining smooth training convergence, reduced metric oscillations, and stable optimization behavior. These findings suggest that an edge first architectural design can preserve high detection effectiveness without relying on increased model complexity.
Class wise analysis supported by confusion matrices revealed that YOLOv26s exhibits improved discrimination between weapon classes and visually similar non-weapon objects acting as hard negatives. Most residual errors across all models arise from object level visual similarity rather than background confusion; however, YOLOv26s shows a reduced tendency toward false positive activations on ambiguous handheld objects. This behavior is particularly relevant for safety critical surveillance systems, where minimizing false alarms is often prioritized over marginal gains in recall.
Beyond conventional accuracy metrics, this study demonstrates that real-time deployment performance is strongly influenced by the computational cost of post-processing. Although all evaluated models present similar inference times, the results show that architectures relying on dense prediction outputs incur a substantial post-processing overhead, which significantly increases total latency and limits throughput on embedded devices.
In contrast, YOLOv10s and YOLOv26s achieve near-negligible post-processing times, enabling end-to-end latencies close to 125–130 ms and real-time performance above 7 FPS on the Jetson Nano platform. This finding reveals that post-processing, rather than inference, constitutes the dominant bottleneck in practical edge deployments, and highlights the importance of output representation design for achieving efficient real-time performance.
From an edge computing perspective, the results highlight that YOLOv26s achieves a favorable balance between detection accuracy and computational efficiency. Unlike earlier compact YOLO variants that rely primarily on scaled down convolutional designs, YOLOv26s represents a distinct architectural paradigm explicitly optimized for edge deployment. The competitive performance observed in this study indicates that such design choices can yield practical advantages for real-time surveillance applications operating under resource constraints.
These findings extend beyond standard benchmarking by providing a deployment-oriented evaluation framework that considers the complete detection pipeline. The results emphasize that selecting object detection models for real-world applications requires a holistic analysis that jointly evaluates preprocessing, inference, and post-processing stages.
Despite these contributions, this study has several limitations. The evaluation was conducted on a single weapon detection dataset and focused on image based inference without explicit temporal modeling across video frames. Additionally, the edge deployment analysis was performed on a single hardware platform (Jetson Nano), and performance may vary across different embedded systems. Future work will investigate temporal integration strategies, multi camera fusion, and cross domain adaptation to diverse surveillance environments, as well as real world deployment of YOLOv26s on embedded edge hardware to evaluate latency, energy consumption, and long term operational stability.
Overall, this work provides empirical evidence that YOLOv26s constitutes a viable and competitive alternative to established compact YOLO models for real-time weapon detection in edge based video surveillance systems. More importantly, it demonstrates that post-processing efficiency is a critical factor for real-time performance, and that edge-oriented detector design represents a promising direction for next generation intelligent surveillance systems.
Statements
Data availability statement
Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/ari-dasci/OD-WeaponDetection/tree/master/Weapons%20and%20similar%20handled%20objects.
Author contributions
CF: Software, Formal analysis, Writing – original draft, Data curation, Investigation, Resources, Methodology, Visualization, Validation. CD-V-S: Conceptualization, Validation, Data curation, Investigation, Methodology, Writing – original draft, Visualization. CB: Methodology, Validation, Writing – review & editing, Formal analysis, Supervision, Visualization. JV-A: Methodology, Conceptualization, Project administration, Funding acquisition, Software, Investigation, Resources, Writing – review & editing, Formal analysis.
Funding
The author(s) declared that financial support was not received for this work and/or its publication.
Acknowledgments
The results of this work are part of the project “Tecnologías de la Industria 4.0 en Educación, Salud, Empresa e Industria” developed by Universidad Tecnológica Indoamérica.
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
1
Al-RefaiG.Al-RefaiM.El-MoaqetH.RyalatM.AlmtireenN. (2025). “Performance evaluation of yolov7 for object detection in dark environments,” in Proceedings of the IEEE International Conference on Electro Information Technology, eIT. doi: 10.1109/eIT64391.2025.11103626
2
BazanR. C.CasanovaR.UgarteW. (2024). “Use of a custom videogame dataset and yolo model for accurate handgun detection in real time video security applications,” in Proceedings of the International Conference on Enterprise Information Systems, ICEIS. doi: 10.5220/0012716500003690
3
BecaV.ZamoraB. A.ParedesC. M.DinasS.Llanos-NeutaN. (2026). “Anomaly detection system based on three dimensional convolutional neural networks and yolo on surveillance videos,” in Communications in Computer and Information Science (Springer). doi: 10.1007/978-3-032-08203-9_27
4
BerardiniD.MigliorelliL.CardoniL.ParenteC.RongoniA.SergiacomiD.et al. (2024). “Benchmark analysis of yolov8 for edge artificial intelligence video surveillance applications,” in Proceedings of the 20th IEEE ASME International Conference on Mechatronic and Embedded Systems and Applications, MESA, 1–6. doi: 10.1109/MESA61532.2024.10704889
5
BhandariM.NaiduV.LawhaleM.KakadeM. (2025). “Weapon detection in video surveillance systems,” in Lecture Notes in Networks and Systems (Singapore: Springer). doi: 10.1007/978-981-96-5604-2_24
6
BurnayevZ. (2023). Weapons detection system based on edge computing and computer vision. Int. J. Adv. Comput. Sci. Applic. 14, 812–820. doi: 10.14569/IJACSA.2023.0140586
7
ChamanM.El MalikiA.DahouH.HadjoudjaA. (2026). Benchmarking yolo based deep learning models for real time object detection in hybrid advanced driver assistance systems and intelligent transportation systems. Results Eng. 72:108942. doi: 10.1016/j.rineng.2025.108942
8
ChawareP.DhamdhereV.SawaneM.PatilS.SarodeP.et al. (2025). Development of intelligent video surveillance system using deep learning and convolutional neural networks, a proactive security solution. Eng. Proc. 114:11. doi: 10.3390/engproc2025114011
9
da LuzG. P. C. P.SatoG. M.GonzalezL. F. G.BorinJ. F. (2026). Smart parking with pixel wise region of interest selection for vehicle detection using yolov8, yolov9, yolov10, and yolov11. Internet Things36:101858. doi: 10.1016/j.iot.2025.101858
10
DebnathM. K.DebnathR. (2021). A comprehensive survey on computer vision based concepts, methodologies, analysis and applications for automatic gun and knife detection. J. Vis. Commun. Image Represent. 78:103165. doi: 10.1016/j.jvcir.2021.103165
11
DeyL. S.DeyS. (2024). “A deep learning approach to gun detection in surveillance systems using yolov9,” in Proceedings of the 2024 Control Instrumentation Systems Conference, CISCON, 1–5. doi: 10.1109/CISCON62171.2024.10696152
12
DuggimpudiS. R.SolipuramR. R.KotturM.RoutS. K.MishraN. (2026). “Enhanced road safety through video based helmet violation detection,” in Lecture Notes in Electrical Engineering (Springer). doi: 10.1007/978-981-95-0269-1_218
13
FarhanM.AkhtarM. N.BakarE. A. (2025). Efficient real time palm oil tree detection and counting using yolov8 deployed on edge devices. J. Umm Al-Qura Univ. Eng. Archit. 16, 1293–1308. doi: 10.1007/s43995-025-00164-7
14
GolandeR.BhapkarR.NalawadeA. (2025). Weapon detection system, real time object recognition for threat detection. Int. J. Res. Appl. Sci. Eng. Technol. 13, 100–108. doi: 10.22214/ijraset.2025.68105
15
HowardA.SandlerM.ChuG.ChenL. C.ChenB.et al. (2019). “Searching for mobilenetv3,” in Proceedings of the IEEE CVF International Conference on Computer Vision, ICCV. doi: 10.1109/ICCV.2019.00140
16
HsuehJ.YangC. T. (2025). Using a high precision yolo surveillance system for gun detection to prevent mass shootings. AI6:198. doi: 10.3390/ai6090198
17
IntagornS.PanmuangM.RodmornC.PinitkanS. (2025). Evaluating the performance of YOLO architectures for effective gun and knife detection. Eng. Access11, 170–177. doi: 10.14456/mijet.2025.16
18
KangY.HauswaldJ.GaoC.RovinskiA.MudgeT.MarsJ.et al. (2017). Neurosurgeon, collaborative intelligence between the cloud and mobile edge. ACM SIGARCH Comput. Archit. News45, 615–629. doi: 10.1145/3093337.3037698
19
KeerthanaS. M. R. S.YadavP. (2024). “Weapon detection for security using the yolo algorithm with email alert notification,” in Proceedings of the 2024 International Conference on Innovations and Challenges in Emerging Technologies, ICICET. doi: 10.1109/ICICET59348.2024.10616365
20
KhanamR.AsgharT.HussainM. (2025). Comparative performance evaluation of yolov5, yolov8, and yolov11 for solar panel defect detection. Solar5:6. doi: 10.3390/solar5010006
21
KrishnaA.PoonkodiM. (2026). “Low light weapon detection using yolov8,” in Lecture Notes in Networks and Systems (Springer).
22
LiY.ChengY.PanY.HeW.WangQ. (2025). “Semantic aware hard negative mining for medical vision language contrastive pretraining,” in Proceedings of the 33rd ACM International Conference on Multimedia, MM. doi: 10.1145/3746027.3754991
23
LiangS.FengX.XieM.TangQ.ZhuH. (2025). Lightweight yolo sr, a method for small object detection in unmanned aerial vehicle aerial images. Appl. Sci. 15:13063. doi: 10.3390/app152413063
24
LinT.-Y.MaireM.BelongieS.HaysJ.PeronaP.RamananD.et al. (2014). “Microsoft coco: common objects in context,” in European Conference on Computer Vision (ECCV), 740–755. doi: 10.1007/978-3-319-10602-1_48
25
LiuL.OuyangW.WangX.FieguthP.ChenJ.LiuX.et al. (2020). Deep learning for generic object detection, a survey. Int. J. Comput. Vis. 128, 261–318. doi: 10.1007/s11263-019-01247-4
26
MuktoM. M.HasanM.Al MahmudM. M.HaqueI.AhmedM. A. (2024). Design of a real time crime monitoring system using deep learning techniques. Intell. Syst. Applic. 21:200311. doi: 10.1016/j.iswa.2023.200311
27
MuruganT.BadushaN. A. N. M.SemaihiA. R. O. A.AlkindiM. M. R.AlnaqbiE. M. R. (2025). Artificial intelligence based weapon detection for security surveillance, recent research advances, 2016 to 2025. Electronics14:4609. doi: 10.3390/electronics14234609
28
NavalgundU. V.PriyadharshiniK. (2018). “Crime intention detection system using deep learning,” in Proceedings of the IEEE International Conference on Computing, Communication and Automation. doi: 10.1109/ICCSDET.2018.8821168
29
Pérez-HernándezF.TabikS.LamasA.OlmosR.FujitaH.HerreraF. (2020). Object detection binary classifiers methodology based on deep learning to identify small objects handled similarly: application in video surveillance. Knowl.-Based Syst. 194:105590. doi: 10.1016/j.knosys.2020.105590
30
RokadeS. A.JadhavS. A.HingeV. U. (2025). Sentinel ai, ai powered real time surveillance for intelligent threat detection. Int. J. Recent Adv. Eng. Technol. 13, 45–52.
31
SalimR. M. W.CheY. C. (2023). “Weapon detection using ssd mobilenet v2 and ssd resnet 50,” in AIP Conference Proceedings, 2680. doi: 10.1063/5.0128081
32
SandhuS.BhatiaR. (2025). A review of visible and concealed weapon detection techniques using machine learning and deep learning. SN Comput. Sci. 6:171. doi: 10.1007/s42979-025-04173-0
33
SantosT.OliveiraH.CunhaA. (2024). A systematic review on weapon detection in surveillance footage using deep learning. Comput. Sci. Rev. 51:100612. doi: 10.1016/j.cosrev.2023.100612
34
Ultralytics (2023). Yolov8: Model architecture and performance. Software documentation and model repository.
35
Ultralytics (2024a). Yolov10: Model architecture and performance. Software documentation and model repository.
36
Ultralytics (2024b). Yolov9: Model architecture and performance. Software documentation and model repository.
37
Ultralytics (2025a). Yolov11: Model architecture and performance. Software documentation and model repository.
38
Ultralytics (2025b). Yolov26: Edge-oriented model architecture and performance metrics. Software documentation and model repository.
39
WarsiA.AbdullahM.HusenM. N.YahyaM.KhanS.JawaidN. (2019). “Gun detection system using yolov3,” in Proceedings of the IEEE International Conference on Smart Instrumentation, Measurement and Applications, ICSIMA, 1–4. doi: 10.1109/ICSIMA47653.2019.9057329
40
YadavP. G. N.YadavS. P. (2025). Weaponvision ai, a software for strengthening surveillance through deep learning in real time automated weapon detection. Int. J. Inf. Technol. 17, 1717–1727. doi: 10.1007/s41870-024-02375-y
41
ZhangX.WangY.LuS.ShiW. (2019). Edge intelligence, paving the last mile of artificial intelligence with edge computing. Proc. IEEE107, 1738–1762. doi: 10.1109/JPROC.2019.2918951
Summary
Keywords
comparative analysis, deep learning, edge computing, video surveillance, weapon detection, YOLO
Citation
Fierro Silva CJ, Del-Valle-Soto C, Bran C and Varela-Aldás J (2026) Comparative analysis of previous YOLO detectors and YOLOv26s for real-time weapon detection in video surveillance. Front. Comput. Sci. 8:1789702. doi: 10.3389/fcomp.2026.1789702
Received
16 January 2026
Revised
29 March 2026
Accepted
02 April 2026
Published
28 April 2026
Volume
8 - 2026
Edited by
Ismail Elezi, Huawei Technologies, China
Reviewed by
Kelvii Wei Guo, City University of Hong Kong, Hong Kong, SAR China
Tejashree Tejpal Moharekar, Shivaji University, India
Harshad Lokhande, MIT Art Design and Technology University, India
Updates
Copyright
© 2026 Fierro Silva, Del-Valle-Soto, Bran and Varela-Aldás.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: José Varela-Aldás, josevarela@uti.edu.ec
Disclaimer
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.