- Power China Chengdu Engineering Corporation Limited, Chengdu, China
Rock block detection plays a crucial role in geological exploration, mine safety monitoring, and landslide early warning systems. Traditional rock block detection methods face challenges of insufficient accuracy and poor real-time performance in complex geological environments. This paper proposes an enhanced rock block detection method based on an improved YOLOv11, which integrates a lightweight Slim-Neck module into the YOLOv11 neck and replaces the original CIoU loss with SIoU loss, termed YOLOv11-SNS, achieving an optimized balance between detection accuracy and computational efficiency. Experimental results show that the improved YOLOv11-SlimNeck-SIoU model achieves an mAP@0.5 of 0.684 under the same computational complexity, a 1.03% improvement over the baseline YOLOv11, with the F1 score increasing to 0.66, validating the effectiveness of the proposed method. This study provides an efficient and reliable technical solution for automatic rock block detection in complex geological environments.
1 Introduction
Rock block detection, as a key technology in geological engineering and mine safety, plays an important role in landslide monitoring, safe operations in open-pit mining, and risk assessment for tunnel construction (Gu and Cao, 2023; Nikolaidis and Saroglou, 2017). Accurately and rapidly identifying and localizing rock blocks not only provides data support for geological disaster early warning, but also effectively safeguards personnel and property (Xu et al., 2021; Zhao et al., 2025). However, due to the diverse shapes, complex textures, and highly variable environmental backgrounds of rock blocks, traditional image processing methods face numerous challenges in real-world applications (Napoli et al., 2021).
Traditional rock block detection methods primarily rely on manual visual inspection or semi-automated solutions based on classical image processing techniques. While manual inspection can be highly reliable, it suffers from low efficiency, strong subjectivity, and difficulty in achieving continuous monitoring. Methods based on classical image processing—such as edge detection, texture analysis, and morphological operations—can perform reasonably well in simple scenarios. However, in complex geological environments, they often struggle to accurately distinguish rock blocks from background, leading to a significant drop in detection accuracy (Chen et al., 2023; Goodman, 1995; Yang et al., 2022).
In recent years, rapid advances in deep learning have created new opportunities for rock block detection (Zhang et al., 2023). Convolutional Neural Networks (CNNs), with their powerful feature learning capabilities, have demonstrated excellent performance in object detection tasks. The YOLO (You Only Look Once) family, as a representative single-stage detection approach, has attracted widespread attention for its strong balance between speed and accuracy (Redmon et al., 2016; Terven et al., 2023). As the latest version, YOLOv11 enhances detection performance by introducing innovations such as the C3K2 module and the C2PSA attention mechanism. Rock block detection differs from general object detection due to its irregular targets, severe scale variation, and low discriminability. To address these challenges, we customize the model by incorporating a Slim-Neck to preserve fine-grained details and enhance multi-scale reuse, which is critical for distinguishing rock blocks from terrain, while adopting the SIoU loss to better adapt to irregular shapes and improve localization.
The main technical challenges in rock block detection include: first, the highly diverse morphological characteristics of rock blocks—significant differences in size, shape, and texture under varying geological conditions—which raise the bar for algorithm generalization (Cheng et al., 2024). Second, factors such as illumination changes, shadow occlusion, and weathering/erosion in natural environments severely affect the visual characteristics of rock blocks, increasing the difficulty of accurate detection. Third, practical applications demand strong real-time performance; particularly in urgent scenarios like landslide early warning, algorithms must respond quickly while providing accurate results (Jiang et al., 2022). Finally, resource constraints in deployment environments cannot be ignored: many geological monitoring devices have limited computational capacity, requiring detection algorithms to maintain low computational complexity while ensuring accuracy.
To address these challenges, this paper proposes an enhanced rock block detection method based on an improved YOLOv11. By thoroughly analyzing the architectural characteristics of YOLOv11 and the specific requirements of rock block detection, we optimize the original model along two key dimensions. First, we integrate a lightweight Slim-Neck module into the network’s neck, which optimizes multi-scale feature fusion pathways and reduces redundant computation, thereby lowering model parameters and computational complexity while maintaining detection accuracy. Second, we replace the original CIoU loss with SIoU loss (Si. et al., 2022), which jointly considers angle and shape losses to improve bounding box regression accuracy, particularly for irregularly shaped rock block targets.
2 Related work
2.1 Traditional rock block detection methods
The development of rock block detection technology has evolved from traditional image processing to deep learning approaches (Medley, 1994). Early rock block detection primarily relied on digital image processing, achieving target recognition by extracting low-level visual features such as color, texture, and shape. Threshold-based segmentation distinguishes rock blocks from the background by setting grayscale or color thresholds. Although these methods can be effective under uniform lighting and simple backgrounds, they are extremely sensitive to illumination changes and complex scenes. Edge detection algorithms such as Canny and Sobel locate rock block contours by identifying boundary information in images; however, the complex textures and irregular shapes of rock surfaces often lead to incomplete edges or a large number of spurious edges.
Texture analysis methods achieve recognition by extracting texture features from rock surfaces. Common descriptors include the gray-level co-occurrence matrix (GLCM) and local binary patterns (LBP). While these methods can capture surface characteristics to some extent, their discriminative power degrades significantly for severely weathered rocks or surfaces with heavy coverage (Hoek, 1995). Morphological operations—such as erosion, dilation, opening, and closing—are used to refine shape features and are often employed in post-processing to improve detection results. Nonetheless, traditional methods generally suffer from limited feature expressiveness, reliance on hand-crafted features, and sensitivity to environmental changes, making them inadequate for rock block detection in complex geological settings.
The introduction of machine learning has opened new avenues for rock block detection. Classifiers such as support vector machines (SVM), random forests, and AdaBoost have been widely applied to rock block recognition tasks. These methods learn statistical patterns from training samples to build classification models, showing better generalization than classical image processing. Feature engineering plays a crucial role in such methods; researchers have proposed a variety of hand-crafted descriptors, including Scale-Invariant Feature Transform (SIFT), Histogram of Oriented Gradients (HOG), and Speeded-Up Robust Features (SURF). Although machine learning improves detection performance to some extent, it remains constrained by the representational limits of hand-crafted features and struggles to fully capture the complex visual characteristics of rock blocks.
2.2 Deep learning object detection algorithms
The emergence of deep learning has fundamentally changed the research paradigm in object detection (Zhao et al., 2019). CNNs automatically learn hierarchical image representations through multi-layer nonlinear transformations, eliminating the need for laborious hand-crafted feature design. Two-stage detectors represented by the R-CNN family first generate region proposals and then classify and localize each proposal. R-CNN inaugurated the deep learning era of object detection by using selective search to generate candidate boxes and CNNs for feature extraction (Girshick et al., 2013). Fast R-CNN significantly improved detection speed by sharing convolutional features, while Faster R-CNN further introduced the region proposal network (RPN), enabling end-to-end training and inference (Ren et al., 2017).
One-stage detectors pursue higher speed by directly regressing object categories and locations from images. YOLO formulates detection as a regression problem, predicting multiple bounding boxes and class probabilities in a single forward pass, thereby greatly improving efficiency (Redmon et al., 2016). SSD (Single Shot MultiBox Detector) performs predictions on multi-scale feature maps, detecting objects of different sizes at different layers and effectively improving small-object performance (Liu et al., 2016). RetinaNet addresses the class imbalance in one-stage detection by introducing the Focal Loss, achieving accuracy comparable to two-stage methods while maintaining high speed (Lin et al., 2017a). CenterNet models detection as a keypoint detection problem, predicting object centers and size information for localization (Duan et al., 2019).
In the domain of rock block detection, deep learning methods have demonstrated significant advantages. Researchers have developed specialized systems based on various architectures. Faster R-CNN–based methods excel in localization accuracy but are relatively slow at inference, making real-time detection challenging. Improved YOLO-based methods strike a good balance between speed and accuracy and have become the mainstream choice for rock block detection (Jiao et al., 2019). Some studies further enhance performance by introducing attention mechanisms and multi-scale feature fusion. Nevertheless, there remains room for improvement when facing challenging scenarios such as complex geological environments, extreme illumination, and severe occlusions.
2.3 Evolution of the YOLO series
Since its initial proposal in 2015, the YOLO family has undergone multiple iterative optimizations, with each version introducing innovations targeting specific challenges (Redmon et al., 2016). YOLOv1 pioneered viewing object detection as a single regression problem, dividing the image into grids and predicting bounding boxes and class probabilities to achieve real-time detection. Despite clear speed advantages, YOLOv1 had limitations in localization accuracy and small-object detection. YOLOv2 introduced improvements including batch normalization, a high-resolution classifier, and anchor boxes, substantially boosting accuracy. It also used dimension clustering to automatically determine anchor sizes, and multi-scale training to enhance scale adaptability.
YOLOv3 adopted Darknet-53 as the backbone and introduced a FPN architecture to make predictions at three different scales, greatly improving multi-scale detection performance (Redmon and Farhadi, 2018). The use of residual connections enabled training deeper networks and further strengthened feature representation. YOLOv4 integrated numerous optimizations, including the CSPDarknet53 backbone, spatial pyramid pooling (SPP), and PANet, as well as training strategies such as Mosaic data augmentation and DropBlock regularization, achieving state-of-the-art performance on the COCO dataset at the time (Bochkovskiy et al., 2020).
Although not an official release, YOLOv5 gained wide popularity due to engineering quality and ease of deployment. It introduced practical features such as auto-anchor computation and adaptive image scaling, and offered model variants of different sizes to suit diverse applications. YOLOX proposed a decoupled head design that separates classification and regression tasks, and adopted the SimOTA label assignment strategy, further improving accuracy (Ge et al., 2021). YOLOv7 achieved notable gains in both speed and accuracy through the Extended-ELAN efficient layer aggregation network and compound model scaling (Wang et al., 2023).
YOLOv8 adopted an anchor-free design, introduced the C2f module and a decoupled head, and achieved strong performance across multiple benchmarks. Building on YOLOv8, YOLOv11 introduced further optimizations. Core innovations include replacing C2f with the C3K2 module to enhance feature extraction via deeper residual connections and improved fusion; introducing the C2PSA module to strengthen sensitivity to spatial positional information; and employing an improved detection head with deformable convolutions to adapt the receptive field shape. These advances enable YOLOv11 to further improve detection accuracy while maintaining high-speed inference.
2.4 Feature fusion techniques
Feature fusion is a key technique in object detection, improving performance by integrating feature information across different layers and scales (Lin et al., 2017b). The FPN constructs multi-scale feature representations through a top-down pathway with lateral connections, effectively combining low-level spatial details with high-level semantic information. This design enables the network to detect objects of corresponding sizes on feature maps at different scales, significantly improving multi-scale detection performance. Building on FPN, the PANet adds a bottom-up path to further strengthen the propagation of low-level features, allowing high-level feature maps to acquire richer detail information (Liu et al., 2018).
The Bidirectional FPN proposes a weighted bidirectional pyramid, introducing learnable weights to balance the contributions of features at different levels, while removing nodes with only a single input to improve efficiency (Tan et al., 2020). The weighted fusion mechanism allows the network to adaptively adjust the importance of different features, and the simplified network structure reduces computational cost. NAS-FPN leverages neural architecture search to automatically design pyramid structures; through reinforcement learning over a vast search space, it discovers optimal fusion architectures and achieves excellent performance on the COCO dataset (Ghiasi et al., 2019).
The ASFF method learns spatial weights across scales to achieve more effective fusion. ASFF learns a set of fusion weights for each spatial location, enabling the network to selectively fuse features from different levels according to the actual target conditions. This adaptive mechanism is particularly suitable for scenarios with drastic scale variations (Liu et al., 2019). The Recursive Feature Pyramid (RFP) introduces recursive connections and feedback so that features can flow and be refined multiple times within the pyramid, gradually improving feature quality (Qiao et al., 2021).
In rock block detection scenarios, feature fusion faces unique challenges. Rock blocks exhibit a wide range of scales—from large rocks at close range to small debris at a distance—requiring the network to effectively integrate multi-scale information. The complex textures of rocks demand low-level features to supply fine-grained texture details, while high-level features provide semantic understanding. Although existing fusion methods address these issues to some extent, there remains room for improvement in computational efficiency and in optimizing fusion strategies.
2.5 Loss function optimization
The design of loss functions has a decisive impact on object detection performance (Rezatofighi et al., 2019). Traditional detection losses typically comprise two parts: classification loss and localization loss. Classification losses optimize category predictions and commonly include cross-entropy loss and focal loss. Focal loss dynamically adjusts loss weights to effectively address class imbalance and the imbalance between easy and hard samples, significantly improving the performance of one-stage detectors (Lin et al., 2017a). Localization losses optimize bounding box regression; early approaches mainly used L1 or L2 losses, but these are inconsistent with the IoU evaluation metric.
IoU loss directly optimizes the intersection-over-union between the predicted and ground-truth boxes, aligning the loss with the evaluation metric. However, when the two boxes do not overlap, the IoU loss has zero gradient and cannot provide a valid optimization direction. Generalized IoU (GIoU) addresses this by introducing the concept of the smallest enclosing box, handling the zero-gradient issue for non-overlapping boxes while considering the relative positions of the prediction and ground truth (Rezatofighi et al., 2019). Distance IoU (DIoU) further considers the distance between box centers and accelerates convergence by minimizing the normalized center distance between the predicted and ground-truth boxes (Zheng et al., 2019).
Complete IoU (CIoU) adds an aspect ratio penalty term on top of DIoU, taking a more comprehensive view of bounding-box geometry (L et al., 2024). Its formulation integrates three factors—overlap, center distance, and aspect ratio—to provide more precise guidance for bounding box regression. However, CIoU is relatively complex to compute and may lead to over-constraint in certain scenarios. Scylla IoU (SIoU) redefines the optimization direction of box regression by introducing an angle loss (Si. et al., 2022). In addition to distance loss, it incorporates shape loss and IoU loss, guiding the predicted box toward the ground-truth box in an angle-aware manner, which is particularly suitable for targets with large aspect-ratio disparities.
In rock block detection, the choice of loss function is especially important. Rock blocks often have irregular shapes, making it difficult for rectangular bounding boxes to accurately describe their contours; thus, the loss function must handle shape variability more flexibly. Their spatial distribution in images is uneven, with both dense and sparse regions, so the loss should adaptively adjust weights across regions. Moreover, high localization accuracy is required—particularly in applications such as geological disaster early warning—where even small positional deviations may lead to serious consequences.
3 Materials and methods
3.1 YOLOv11 architecture and limitations analysis—Translation
YOLOv11, as the latest iteration of the YOLO family, demonstrates outstanding performance in object detection tasks. Its overall architecture comprises three main components—the Backbone, Neck, and Head—and adopts an end-to-end design philosophy for efficient detection (Figure 1). The backbone leverages the CSPDarknet architecture and builds a deep feature extraction network with C3K2 modules, outputting three feature maps at different stages of feature extraction that correspond to large-, medium-, and small-scale object detection. The neck performs multi-scale feature fusion using C3K2 blocks and the C2PSA module, while the detection head adopts a decoupled design that separates classification and regression for independent processing.
However, despite YOLOv11’s strong performance on general object detection tasks, it exhibits notable limitations in the specific application scenario of rock block detection. First, while the original neck-level feature fusion structure effectively integrates multi-scale information, there is still room to further optimize computational complexity. The neck makes extensive use of C3K2 and C2PSA operations; although these components provide effective feature processing capabilities, they incur considerable computational overhead. On resource-constrained geological monitoring devices, this complexity affects deployment efficiency and real-time performance.
Second, when dealing with rock blocks that exhibit diverse morphologies and complex textures, YOLOv11’s feature fusion strategy leaves room for improvement. Standard convolutions in the neck cause partial semantic information loss during each step of spatial downsampling and channel expansion. While the original model’s fusion is effective, it lacks the efficient information-mixing mechanisms that more advanced convolutional techniques could offer. This is especially pertinent for small-sized rock blocks, where there is further potential to improve feature preservation and representation.
In addition, the CIoU loss adopted by YOLOv11 also shows shortcomings for rock block detection. The CIoU formulation is given in Equation 1.
Here,
3.2 Slim-Neck lightweight feature fusion architecture
To alleviate the computational overhead of the original YOLOv11 neck and enhance feature fusion efficiency, we integrate a Slim-Neck lightweight feature fusion architecture based on GSConv and VoV-GSCSP (Tan et al., 2020). Although Slim-Neck has proven effective in general object detection tasks, its application in rock block detection has not yet been explored. In this work, we investigate integrating Slim-Neck components into a YOLOv11 neck architecture tailored for rock block detection, aiming to significantly reduce computational complexity while maintaining or improving detection performance through more efficient convolutional operations and optimized feature aggregation strategies.
3.2.1 GSConv module design
GSConv, the core component of Slim-Neck, cleverly combines the strengths of SC and DSC. Standard convolution maximizes inter-channel connectivity but incurs high computational cost; depthwise separable convolution is computationally efficient but completely severs inter-channel connections, weakening feature expressiveness. GSConv strikes a balance between the two via a hybrid strategy, as shown in Figure 2.
Figure 2. GSConv module architecture. This module combines standard convolution and depthwise separable convolution branches, achieving efficient feature mixing through concatenation and shuffle operations.
The implementation of GSConv consists of three steps: First, the input features are processed in parallel by the standard convolution and depthwise separable convolution branches. The SC branch preserves dense inter-channel connectivity and captures rich semantic information; the DSC branch extracts spatial features at lower computational cost. Second, the outputs of the two branches are concatenated along the channel dimension. Finally, a shuffle operation evenly diffuses the SC features into every part of the features produced by the DSC branch, enabling comprehensive information mixing.
The shuffle operation is the key innovation of GSConv. Through a uniform mixing strategy, it enables thorough interaction between features from different branches. It is implemented as a linear channel-rearrangement operation, which is computationally inexpensive and widely supported on devices capable of convolutional computation.
3.2.2 VoV-GSCSP module
Building on GSConv, we further incorporate the VoV-GSCSP module, an efficient cross-stage partial network architecture. VoV-GSCSP adopts a one-shot aggregation strategy that simultaneously aggregates the outputs of multiple GS bottlenecks to improve feature reuse while reducing computational complexity, as shown in Figure 3.
Figure 3. VoV-GSCSP module architecture. This structure employs cross-stage connections and one-shot aggregation of multiple GS bottlenecks to achieve efficient feature processing.
The design of VoV-GSCSP follows these principles: the input features are first split into two parts—one is directly forwarded to the output (cross-stage connection), and the other is processed through multiple cascaded GS bottlenecks. Within each GS bottleneck, GSConv replaces standard convolution, greatly reducing computational load. The outputs of all GS bottlenecks are aggregated via concatenation, then concatenated again with the cross-stage features, and finally fused and channel-adjusted using a GSConv operation.
3.3 SIoU loss function optimization
To address the limitations of the CIoU loss in rock block detection, we introduce the SIoU loss function (Terven et al., 2023). SIoU resolves CIoU’s gradient issues through an angle-aware distance loss and an independent shape loss, improving the accuracy and efficiency of bounding-box regression.
3.3.1 Angle-aware distance loss
The core innovation of SIoU is the introduction of angle information to guide bounding-box regression. Let the center of the predicted box be
Based on this angle, SIoU defines an angle cost function as in Equation 3.
This function approaches 0 when the angle is close to 0∘ or 90∘, and reaches its maximum value of 1 at 45∘. With this design, SIoU adaptively adjusts the regression direction according to the relative positions of the predicted and ground-truth boxes, moving the bounding box along the shortest path.
The distance loss is defined as shown in Equation 4:
The distance loss is defined as in Equation 4, where
3.3.2 Shape loss design
Unlike CIoU, which uses the aspect ratio as a shape constraint, SIoU directly penalizes width and height discrepancies, as expressed in Equation 5.
Here,
3.3.3 Complete SIoU loss
The complete expression of the SIoU loss is given in Equation 6.
This design takes IoU as the primary optimization objective while providing additional guidance through an angle-aware distance loss and a flexible shape loss. Compared with CIoU, SIoU offers the following advantages: the angle-aware mechanism makes bounding-box regression more efficient, especially when handling tilted or irregularly arranged rock blocks; an independent width–height loss avoids gradient conflicts and allows more flexible shape adjustment; lower computational complexity and improved training efficiency.
3.4 YOLOv11-SNS: integrated architecture design
Building on the above improvements, we propose YOLOv11-SNS, an enhanced rock block detection model that synergistically combines the Slim-Neck lightweight feature fusion architecture with the SIoU loss function. The overall architecture is shown in Figure 4. The integrated design follows several key principles to ensure optimal performance in rock block detection scenarios.
3.4.1 Network architecture integration
The YOLOv11-SNS architecture retains YOLOv11’s robust backbone while implementing strategic modifications to the neck and loss computation components. The backbone continues to use the CSPDarknet architecture with the C3K2 module, preserving the original model’s powerful feature extraction capability. This decision is supported by extensive analyses showing that the backbone’s feature extraction has been well optimized for capturing the complex visual characteristics of rock blocks.
In the neck, we strategically integrate Slim-Neck components into the existing YOLOv11 neck architecture. First, we selectively replace standard convolutions along key feature-fusion paths with GSConv modules, substantially reducing computational overhead while maintaining representation quality. Second, we enhance critical feature aggregation nodes with the VoV-GSCSP module, enabling more efficient cross-stage feature propagation and improved feature reuse.
The detection head retains YOLOv11’s decoupled design. The classification branch continues to use cross-entropy loss for category prediction, and the network structure of the regression branch remains unchanged; however, during training the original CIoU loss is replaced with the SIoU loss. Through its angle-aware and independent shape optimization mechanisms, the SIoU loss provides more precise guidance for bounding-box regression on irregularly shaped rock blocks.
4 Discussion
4.1 Dataset
We use a public rock dataset from Roboflow containing 6,904 high-quality images of rock blocks that span diverse geological environments and rock types. The images are collected from different scenes to ensure diversity and representativeness for the rock block detection task. All images are precisely annotated with bounding boxes marking the locations and categories of rock blocks. The dataset is pre-split into a training set (6,602 images), a validation set (202 images), and a test set (100 images), providing a comprehensive evaluation framework while ensuring a balanced distribution of rock types and scenes across all subsets.
4.2 Experimental setup
The Slim-Neck + SIoU combination outperforms generic frameworks for rock blocks. Regular objects (e.g., vehicles) rely on stable semantics/geometries, making standard fusion/loss sufficient. Rock blocks, however, demand detail-preserving fusion (Slim-Neck) and shape-adaptive loss (SIoU)—this customization is the study’s key novelty beyond direct integra-tion. The experiments are conducted on a modern computing platform to ensure training efficiency and reproducibility. The hardware configuration includes an NVIDIA GeForce RTX 3060 GPU with 12 GB of GDDR6 memory, providing ample compute for deep learning training. The software environment is based on Ubuntu 22.04, using the PyTorch 2.3.1 deep learning framework with Python 3.10, and CUDA 12.1 for GPU acceleration.
Training hyperparameters are optimized for the characteristics of rock block detection and hardware constraints. The configuration is as follows: the task type is set to object detection (‘detect’), the cache is disabled (cache = False) to ensure fresh data loading each epoch, and the input image size is standardized to 640 × 640 pixels (imgsz = 640), balancing computational efficiency and detection accuracy. The total number of training epochs is set to 300, allowing sufficient iterations for model convergence.
Training disables single-class detection (single_cls = False) to support multi-class rock block classification. The batch size is set to 16 (batch = 16), making optimal use of the 12 GB GPU memory while maintaining stable gradient updates. Data loading uses 4 worker processes (workers = 4) for efficient data pipelining. Training runs on GPU device 0 (device = “0”) and uses the SGD optimizer (optimizer = “SGD”) for stable and reliable convergence.
4.3 Evaluation metrics
To comprehensively evaluate the performance of the proposed YOLOv11-SNS model, we adopt standard measures from the object detection domain. The evaluation framework includes four key metrics that assess different aspects of detection performance.
Precision measures the accuracy of positive predictions, indicating the proportion of correctly identified rock blocks among all detected objects, as defined in Equation 7:
Here, TP (true positives) denotes correctly detected rock blocks, and FP (false positives) denotes incorrectly detected objects.
Recall evaluates detection completeness, indicating the proportion of actual rock blocks that are correctly identified, as shown in Equation 8:
Here, FN (false negatives) denotes rock blocks that should have been detected but were missed.
The F1 score provides a balanced evaluation by computing the harmonic mean of precision and recall, offering a single metric that considers both accuracy and completeness, as expressed in Equation 9:
Mean Average Precision (mAP@0.5) denotes the average precision over all classes at an IoU threshold of 0.5, providing a comprehensive assessment of detection performance, as defined in Equation 10:
where is the number of target classes, and is the average precision of class. An IoU threshold of 0.5 means a detection is considered correct when the intersection-over-union between the predicted and ground-truth boxes exceeds 0.5.
In addition, we record model parameters and computational complexity (GFLOPs) to evaluate deployment efficiency and practical applicability in resource-constrained geological monitoring environments. These metrics ensure a comprehensive assessment of both detection performance and computational efficiency, which are critical for real-world deployment of geological monitoring systems.
4.4 Ablation study
The ablation experiments aim to analyze the contribution of specific modules to overall performance by incrementally adding or removing them, thereby validating the effectiveness of each improved component. We designed four comparative experiments: the baseline model (original YOLOv11), introducing only the Slim-Neck module, replacing only the loss with SIoU, and the fully improved model combining both enhancements. The results are summarized in the table below (Table 1).
The results show that the baseline YOLOv11 model achieved an mAP@0.5 of 0.677 and an F1 score of 0.65 on the rock block detection dataset, with 2,590,035 parameters and a computational complexity of 6.4 GFLOPs. This baseline performance establishes a solid foundation, demonstrating the strong capability of YOLOv11 for subsequent improvements.
When only the Slim-Neck module was introduced, the model exhibited clear improvements. The mAP@0.5 increased to 0.682, a 0.74% gain over the baseline. The F1 score remained stable at 0.65, indicating consistent precision–recall balance. Although the parameter count increased to 2,740,435 due to the additional Slim-Neck components, the computational complexity remained at 6.4 GFLOPs, reflecting the efficiency of the lightweight design. This result validates that integrating Slim-Neck into YOLOv11 enhances rock block detection performance while maintaining computational efficiency.
The experiment with only the SIoU loss revealed interesting characteristics in optimization. The mAP@0.5 slightly decreased to 0.675 (−0.30% compared to the baseline), while the F1 score stayed at 0.65, the same as the baseline. This indicates that the SIoU loss provides different optimization dynamics than CIoU, with its potential advantages becoming more apparent when combined with other improvements. Importantly, the parameter count and computational complexity remained identical to the baseline, confirming that changing the loss function introduces no additional inference overhead.
The fully improved model (YOLOv11-SNS) demonstrated the synergistic benefits of combining both enhancements, achieving the best detection performance across multiple metrics. The mAP@0.5 reached 0.684, a 1.03% improvement over the baseline. Notably, the F1 score increased to 0.66, indicating a better balance between precision and recall. Due to Slim-Neck integration, the parameter count rose to 2,740,435, but the computational complexity remained unchanged at 6.4 GFLOPs, demonstrating the efficiency of the combined approach.
The synergy between the two improvements is particularly noteworthy. While individual components showed modest gains or trade-offs, their combination produced complementary benefits that enhance overall detection capability. The Slim-Neck module’s strengthened feature fusion works effectively with SIoU’s improved bounding-box regression strategy, achieving better detection accuracy without compromising computational efficiency.
Through detailed analysis of detection performance across various rock block characteristics, we observed the most significant gains in challenging scenarios. For small rock blocks occupying less than 2% of the image area, the full model exhibited improved detection consistency. For irregularly shaped rock blocks, the integration of the SIoU loss with enhanced feature representations provided more accurate bounding-box localization. In complex background scenes, the improved model showed enhanced feature discriminability, reducing false positives while maintaining a high true positive rate.
4.5 Comparative experiments
To comprehensively evaluate the performance of our proposed method, we compare the improved model with several mainstream object detection algorithms, including YOLOv5, YOLOv12, YOLOv13, and RT-DETR with a ResNet50 backbone. All comparison models are trained on the same dataset using consistent data augmentation strategies and training settings to ensure a fair comparison.
As a well-established detector, YOLOv5 achieves an mAP@0.5 of 0.658 on the rock block detection task with 2,508,659 parameters. It shows stable performance with an F1 score of 0.63 and a computational complexity of 7.2 GFLOPs.
YOLOv12, representing a major advancement in the YOLO family, achieves an mAP@0.5 of 0.679 with 2,568,243 parameters and a computational complexity of 6.5 GFLOPs. Compared with YOLOv5, YOLOv12 demonstrates improved efficiency while maintaining competitive performance, with an F1 score of 0.65.
YOLOv13 adopts an optimized architecture with 2,460,106 parameters, achieving an mAP@0.5 of 0.678 and an F1 score of 0.65. YOLOv13 maintains a computational complexity of 6.4 GFLOPs—identical to our baseline model—and shows improved performance and enhanced accuracy metrics compared with YOLOv5.
RT-DETR with a ResNet50 backbone attains an mAP@0.5 of 0.672 and an F1 score of 0.63. However, it requires 42,762,787 parameters and 130.5 GFLOPs, representing a significantly higher computational demand than YOLO-based approaches.
Our YOLOv11-SNS method delivers superior performance across all evaluation metrics. It achieves the highest mAP@0.5 of 0.684 and the highest F1 score of 0.66, while maintaining the same computational complexity as the baseline YOLOv11 (6.4 GFLOPs). The parameter count increases moderately to 2,740,435, compared with 2,590,035 for the baseline.
An analysis of computational efficiency shows that our method retains the same GFLOPs (6.4) as the baseline YOLOv11 while delivering improved performance. Despite its competitive accuracy, RT-DETR-ResNet50—due to its much higher computational cost (130.5 GFLOPs) and over 42 million parameters—faces deployment challenges in resource-constrained environments.
Within the YOLO series, YOLOv12 exhibits competitive performance with an mAP@0.5 of 0.679, followed closely by YOLOv13 with an mAP@0.5 of 0.678 and an F1 score of 0.65. Both YOLOv12 and YOLOv13 outperform YOLOv5, with YOLOv13 achieving particularly balanced results on accuracy metrics. Parameter counts across YOLO models are relatively comparable, and our method requires only a modest increase from the baseline’s 2,590,035 to 2,740,435 parameters.
To make the comparison of detection metrics more intuitive, Figure 5 presents a grouped bar chart of mAP@0.5 and F1 scores for all evaluated models. The visualization clearly shows the superiority of our YOLOv11-SNS method on both metrics.
As shown in Figure 5, our YOLOv11-SNS achieves the highest mAP@0.5 (0.684), a 1.03% improvement over the baseline YOLOv11 (0.677), and surpasses all other tested models. The F1 score also reaches 0.66—the highest among all models—indicating a better balance between precision and recall. Notably, while YOLOv12 and YOLOv13 show competitive performance (mAP@0.5 of 0.679 and 0.678, respectively), our method consistently outperforms them on both evaluation metrics. RT-DETR-ResNet50, despite its substantially higher computational complexity (about 42.7 million parameters), delivers lower performance with an mAP@0.5 of 0.672 and an F1 score of 0.63, further confirming that model complexity does not guarantee better performance in specialized tasks such as rock block detection.
The consistent improvements in both mAP@0.5 and F1 score indicate that our proposed enhancements—integrating Slim-Neck feature fusion and the SIoU loss—effectively boost detection capability without compromising the balance between precision and recall. This balanced improvement is crucial for real-world deployment in geological monitoring systems, where both accuracy and reliability are essential.
To provide a comprehensive visual comparison of model performance across multiple dimensions, Figure 6 presents a radar chart illustrating normalized performance metrics. This figure highlights the multidimensional advantages of our proposed YOLOv11-SNS, particularly its balanced accuracy metrics while maintaining computational efficiency.
Parameter count and GFLOPs are inverted for clearer visualization (lower values indicate better performance). F1 score and mAP@0.5 are normalized (higher values indicate better performance). Our YOLOv11-SNS exhibits outstandingly balanced performance across all dimensions.
The results show that our YOLOv11-SNS achieves superior performance compared with other tested models, with the highest mAP@0.5 of 0.684 and F1 score of 0.66, while maintaining computational efficiency comparable to the baseline. The radar chart further confirms the comprehensive advantages of our method in achieving an optimal balance between detection accuracy and computational efficiency.
4.6 Visual detection results
To intuitively illustrate performance differences across models, we present visual comparisons on representative test images containing rock blocks of various sizes, shapes, and arrangements. Figure 7 shows detection results from different models on a challenging rock surface image with densely distributed rock blocks.
The visualizations reveal several key observations about model performance. The ground-truth annotated image contains roughly 20 rock blocks of varying sizes. Our YOLOv11-SNS successfully detects most targets, demonstrating strong detection capability. YOLOv12 and YOLOv13 show similar coverage, detecting a comparable number of rock blocks. The baseline model also achieves reasonable detection completeness. YOLOv5 performs moderately, with some missed detections. In contrast, RT-DETR-ResNet50 suffers from severe over-detection, producing excessive overlapping bounding boxes and resulting in cluttered, impractical outputs that can be problematic in real-world applications.
A particularly notable advantage of our YOLOv11-SNS is its superior performance on small rock blocks. While YOLOv5 and the baseline tend to miss smaller targets, especially those occupying less than 2% of the image area, our method successfully detects these challenging small-scale rocks. This improvement can be attributed to the Slim-Neck module’s enhanced feature fusion, which better preserves fine-grained features across network layers. The ability to reliably detect small rocks is crucial for geological monitoring applications, where early detection of small fragments can signal potential instability or impending rockfall hazards.
Our method produces clean, well-localized bounding boxes that accurately delineate rock boundaries. The baseline, YOLOv12, and YOLOv13 generate similarly accurate boxes with good localization quality. YOLOv5 provides reasonable bounding box accuracy for its detected targets. RT-DETR-ResNet50, however, outputs are overwhelmed by redundant and overlapping boxes, making it difficult to distinguish individual rocks, indicating significant challenges in adapting transformer-based detectors to this specific application domain.
An analysis of the actual confidence scores displayed in the images shows that all YOLO-based models exhibit similar confidence distributions, mainly ranging from 0.3 to 0.8. Our YOLOv11-SNS has confidence scores comparable to YOLOv12 and YOLOv13, distributed within 0.3–0.8. The baseline shows a slightly wider range (0.3–0.9), while YOLOv5 tends toward the lower end of the spectrum (0.3–0.7). Notably, our method maintains stable confidence scores even for small rocks, indicating robust feature discriminability across different target scales.
A key advantage of our YOLOv11-SNS is that it yields clean, practical detection outputs. While maintaining detection performance comparable to YOLOv12 and YOLOv13, our method avoids the computational overhead of more complex architectures. The combination of improved small-object detection and sustained computational efficiency (6.4 GFLOPs) makes it particularly suitable for real-time monitoring systems. In stark contrast, RT-DETR-ResNet50 produces unusable outputs with severe over-detection despite its sophisticated architecture and substantially higher computational demands (130.5 GFLOPs).
These visual results corroborate the quantitative metrics in Table 2, indicating that our YOLOv11-SNS achieves competitive detection performance—especially for small targets—while maintaining the computational efficiency essential for deployment in resource-constrained geological monitoring systems.
5 Conclusion
This paper proposes an enhanced rock block detection method, YOLOv11-SNS, based on an improved YOLOv11 that integrates a Slim-Neck lightweight feature fusion module and the SIoU loss into the YOLOv11 framework. Experimental results show that our method achieves superior detection performance while maintaining computational efficiency, reaching an mAP@0.5 of 0.684 and an F1 score of 0.66, representing improvements of 1.0% and 1.5% over the baseline YOLOv11, respectively.
The study successfully integrates the Slim-Neck module, which leverages depthwise separable convolutions and channel attention to strengthen multi-scale feature fusion, adding only 150,400 parameters with well-controlled computational overhead. In addition, adopting the SIoU loss—with angle awareness and shape constraints—significantly improves bounding-box regression accuracy for irregular rock targets. Finally, systematic comparative experiments validate the overall advantages of YOLOv11-SNS over mainstream models such as YOLOv5, YOLOv12, YOLOv13, and RT-DETR.
Experiments demonstrate that the proposed method effectively improves detection performance while maintaining a computational load of 6.4 GFLOPs (identical to the baseline), indicating strong potential for practical deployment—especially in resource-constrained field environments, such as real-time monitoring based on optical imagery and visual perception tasks in advanced manufacturing processes.
Besides performance gains, this study provides useful guidance for addressing irregular, low-discriminability targets. Spe-cifically, lightweight detail-preserving feature fusion and shape-adaptive regression loss work synergistically to tackle such challenges, offering a reference for specialized detection tasks in geotechnical engineering and related fields. The core value of this work lies in developing a practical, efficient rock block detection model and validating the synergy of the two tech-nical elements mentioned above.
Future research will focus on the following: exploring more efficient feature fusion architectures to further enhance multi-scale perception; developing dynamic loss functions that adapt to different target characteristics; validating generalization on larger and more diverse datasets—including multispectral and high-resolution optical images; and advancing model lightweighting techniques for edge optical devices, such as quantization and neural network pruning, to promote practical applications in geological hazard early warning and mine safety monitoring.
Advances in rock block detection are of great significance for geological disaster prevention and safe mining operations. The proposed method achieves a favorable balance between detection accuracy and computational efficiency, providing a reliable and efficient visual perception solution for related optical inspection systems. As deep learning and optical imaging technologies continue to converge, we expect such methods to further advance intelligent monitoring and advanced manufacturing applications in the geoscience and mining industries.
Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.
Author contributions
YY: Data curation, Resources, Writing – review and editing, Methodology, Writing – original draft. HY: Writing – review and editing. HL: Writing – review and editing. YZ: Writing – review and editing, Resources. YH: Writing – review and editing, Resources. ZW: Writing – review and editing, Resources.
Funding
The author(s) declared that financial support was not received for this work and/or its publication.
Conflict of interest
Authors YY, HY, HL, YZ, YH, and ZW were employed by Power China Chengdu Engineering Corporation Limited.
Generative AI statement
The author(s) declared that generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Footnotes
Abbreviations:SNS, SlimNeck-SIoU; CNNs, Convolutional neural networks; YOLO, You Only Look Once; FPN, Feature Pyramid Network; PANet, Path Aggregation Network; ASFF, Adaptive Spatial Feature Fusion; GSConv, Ghost Shuffle Convolution; SC, standard convolution; DSC, depthwise separable convolution; VoV-GSCSP, Variety of View – GSConv Cross Stage Partial; SNS, SlimNeck-SIoU.
References
Bochkovskiy, A., Wang, C.-Y., and Liao, H.-Y. M. (2020). YOLOv4: optimal speed and accuracy of object detection. doi:10.48550/arXiv.2004.10934
Chen, J., Fang, Q., Zhang, D., and Huang, H. (2023). A critical review of automated extraction of rock mass parameters using 3D point cloud data. Intell. Transp. Infrastruct. 2, liad005. doi:10.1093/iti/liad005
Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., and Shan, Y. (2024). “YOLO-World: real-time open-vocabulary object detection,” in 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Seattle, WA, USA: IEEE), 16901–16911. doi:10.1109/CVPR52733.2024.01599
Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., and Tian, Q. (2019). “CenterNet: keypoint triplets for object detection,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (Seoul, Korea (South): IEEE), 6568–6577. doi:10.1109/ICCV.2019.00667
Ge, Z., Liu, S., Wang, F., Li, Z., and Sun, J. (2021). YOLOX: exceeding YOLO series in 2021. doi:10.48550/arXiv.2107.08430
Ghiasi, G., Lin, T.-Y., and Le, Q. V. (2019). “NAS-FPN: learning scalable feature pyramid architecture for object detection,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Long Beach, CA, USA: IEEE), 7029–7038. doi:10.1109/CVPR.2019.00720
Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2013). Rich feature hierarchies for accurate object detection and semantic segmentation.
Goodman, R. E. (1995). Block theory and its application. Géotechnique 45 (3), 383–423. doi:10.1680/geot.1995.45.3.383
Gu, Z., and Cao, M. (2023). Analysis of unstable block by discrete element method during blasting excavation of fractured rock mass in underground mine. Heliyon 9 (11), e22558. doi:10.1016/j.heliyon.2023.e22558
Jiang, P., Ergu, D., Liu, F., Cai, Y., and Ma, B. (2022). A review of Yolo algorithm developments. Procedia Comput. Sci. 199, 1066–1073. doi:10.1016/j.procs.2022.01.135
Jiao, L., Zhang, F., Liu, F., Yang, S., Li, L., Feng, Z., et al. (2019). A survey of deep learning-based object detection. IEEE Access 7, 128837–128868. doi:10.1109/ACCESS.2019.2939201
Li, H., Li, J., Wei, H., Liu, Z., Zhan, Z., and Ren, Q. (2024). Slim-neck by GSConv: a lightweight-design for real-time detector architectures. J. Real-Time Image Process. 21 (3), 62. doi:10.1007/s11554-024-01436-6
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. (2017a). Focal loss for dense object detection. 2999–3007. doi:10.1109/iccv.2017.324
Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. (2017b). “Feature Pyramid networks for object detection,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Honolulu, HI: IEEE), 936–944. doi:10.1109/CVPR.2017.106
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., et al. (2016). SSD: single shot MultiBox detector. 9905. 21–37. doi:10.1007/978-3-319-46448-0_2
Liu, S., Qi, L., Qin, H., Shi, J., and Jia, J. (2018). Path aggregation network for instance segmentation. 8759–8768. doi:10.1109/cvpr.2018.00913
Liu, S., Huang, D., and Wang, Y. (2019). Learning spatial fusion for single-shot object detection. doi:10.48550/arXiv.1911.09516
Medley, E. W. (1994). The engineering characterization of melanges and similar block-in-matrix rocks (Bimrocks). Berkeley, CA: University California. Ph.D. dissertation,.
Napoli, M. L., Barbero, M., and Scavia, C. (2021). Effects of block shape and inclination on the stability of melange bimrocks. Bull. Eng. Geol. Environ. 80 (10), 7457–7466. doi:10.1007/s10064-021-02419-8
Nikolaidis, G., and Saroglou, C. (2017). Engineering geological characterisation of block-in-matrix rocks. Bull. Geol. Soc. Greece 50 (2), 874. doi:10.12681/bgsg.11793
Qiao, S., Chen, L.-C., and Yuille, A. (2021). “DetectoRS: detecting objects with recursive feature Pyramid and switchable atrous convolution,” in 2021 IEEEC/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Nashville, TN, USA: IEEE), 10208–10219. doi:10.1109/CVPR46437.2021.01008
Redmon, J., and Farhadi, A. (2018). YOLOv3: an incremental improvement. doi:10.48550/arXiv.1804.02767
Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016). “You only look once: unified, real-time object detection,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA (IEEE), 779–788. doi:10.1109/CVPR.2016.91
Ren, S., He, K., Girshick, R., and Sun, J. (2017). Faster R-CNN: towards real-time object detection with Region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39 (6), 1137–1149. doi:10.1109/TPAMI.2016.2577031
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., and Savarese, S. (2019). “Generalized intersection over union: a metric and a loss for bounding box regression,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Long Beach, CA, USA: IEEE), 658–666. doi:10.1109/CVPR.2019.00075
Tan, M., Pang, R., and Le, Q. V. (2020). “EfficientDet: scalable and efficient object detection,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Seattle, WA, USA: IEEE), 10778–10787. doi:10.1109/CVPR42600.2020.01079
Terven, J., Córdova-Esparza, D.-M., and Romero-González, J.-A. (2023). A comprehensive review of YOLO architectures in computer vision: from YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 5 (4), 1680–1716. doi:10.3390/make5040083
Wang, C.-Y., Bochkovskiy, A., and Liao, H.-Y. M. (2023). “YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Vancouver, BC, Canada: IEEE), 7464–7475. doi:10.1109/CVPR52729.2023.00721
Xu, Z., Liu, F., Lin, P., Shao, R., and Shi, X. (2021). Non-destructive, in-situ,, fast identification of adverse geology in tunnels based on anomalies analysis of element content. Tunn. Undergr. Space Technol. 118, 104146. doi:10.1016/j.tust.2021.104146
Yang, Y., Wang, S., Zhang, M., and Wu, B. (2022). Identification of key blocks considering finiteness of discontinuities in tunnel engineering. Front. Earth Sci. 10, 794936. doi:10.3389/feart.2022.794936
Zhang, W., Zhang, W., Zhang, G., Huang, J., Li, M., Wang, X., et al. (2023). Hard-rock tunnel lithology identification using multi-scale dilated convolutional attention network based on tunnel face images. Front. Struct. Civ. Eng. 17 (12), 1796–1812. doi:10.1007/s11709-023-0002-1
Zhao, Z.-Q., Zheng, P., Xu, S., and Wu, X. (2019). Object detection with deep learning: a review. IEEE Trans. Neural Netw. Learn. Syst. 30, 3212–3232. doi:10.48550/arXiv.1807.05511
Zhao, X., Zhang, W., Chen, J., Xu, Z., Zhang, Y., Yin, H., et al. (2025). Automatic identification of concealed dangerous rock blocks on high-steep slopes considering finite-sized discontinuity intersections. J. Rock Mech. Geotech. Eng. 17, 7093–7106. doi:10.1016/j.jrmge.2025.01.027
Keywords: object detection, rock block detection, SIoU loss function, slim-neck, YOLOv11
Citation: Yang Y, Yang H, Li H, Zhou Y, Huang Y and Wang Z (2026) Enhanced rock block detection method based on an improved YOLOv11. Front. Earth Sci. 14:1741519. doi: 10.3389/feart.2026.1741519
Received: 07 November 2025; Accepted: 08 January 2026;
Published: 26 January 2026.
Edited by:
Chong Xu, Ministry of Emergency Management, ChinaReviewed by:
Zarghaam Rizvi, GeoAnalysis Engineering GmbH, GermanyChangshuo Wang, Ningbo University, China
Copyright © 2026 Yang, Yang, Li, Zhou, Huang and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Yiming Yang, MjAyMTExMEBjaGlkaS5jb20uY24=
Hong Yang