PrecisionMicro-DETR: enhancing small pulmonary nodule detection in CT scans with multi-scale feature fusion and lightweight design

Chen, Jianle; Zhu, Jianyu; Lin, YuYan; Deng, Fuqin; Fu, Lanhui; Liao, Huilian

doi:10.3389/fcomp.2026.1763780

METHODS article

Front. Comput. Sci., 11 February 2026

Sec. Computer Vision

Volume 8 - 2026 | https://doi.org/10.3389/fcomp.2026.1763780

PrecisionMicro-DETR: enhancing small pulmonary nodule detection in CT scans with multi-scale feature fusion and lightweight design

1. Shunde Hospital of Guangzhou University of Chinese Medicine, Foshan, China
2. School of Business, Macau University of Science and Technology, Taipa, Macao SAR, China
3. School of Electronic and Information Engineering, The Wuyi University, Jiangmen, China

Article metrics

View details

280

Views

Downloads

Abstract

To address the common issue of insufficient accuracy in existing detection models when dealing with morphologically complex and minute pulmonary nodules, this study proposes an enhanced detection model called PrecisionMicro-DETR based on the RT-DETR architecture. The model introduces a feature enhancement fusion module tailored for small targets in the detection head to strengthen the feature extraction capability for subtle structures (Strengthen the integration of small target features, SSTF). It also incorporates a Modulation Fusion Module (MFM) to effectively improve discriminative performance in areas with blurred boundaries between lesions and normal tissues. Additionally, a lightweight neck network based on SNI-GSConvE is introduced to optimize computational load while maintaining high accuracy. Experimental evaluation shows that PrecisionMicro-DETR achieves a mean average precision (mAP) of 94.9% on the publicly available Tianchi dataset. Its robustness and generalization ability in real diagnostic environments are further validated through clinical CT images from hospital PACS systems. This study provides a high-precision and efficient solution for CT pulmonary nodule detection, contributing positively to advancing the clinical application of intelligent assisted diagnostic systems.

1 Introduction

Lung cancer is the leading cause of cancer-related deaths worldwide, with the highest incidence and mortality rates among all cancer types (Bray et al., 2018). According to data from the World Health Organization, approximately 1.8 million people die from lung cancer annually, accounting for a significant 18% of global cancer-related deaths each year (Zhang et al., 2024). Studies have confirmed that early screening can increase the five-year survival rate of lung cancer patients by approximately 60% (Henschke, 2001). Pulmonary nodules are a common early manifestation of lung cancer. Medically, they are defined as round or irregularly shaped lung lesions with a diameter not exceeding 3 centimeters (Karki, 2017). Clinical studies have demonstrated that the emergence of symptoms such as cough, chest pain, and hemoptysis often indicates malignant progression or cancerous transformation of pulmonary nodules (Mazzone and Lam, 2022). Therefore, achieving early detection and precise management at the pulmonary nodule stage is regarded as a primary strategy for halting the progression of lung cancer and reducing its incidence rate (Jin et al., 2023). Clinical data confirm that the earlier standardized treatment is initiated, the more beneficial it is for extending patients’ overall survival and improving their quality of life (Li et al., 2022). At present, the clinical detection of pulmonary nodules primarily relies on radiologists’ visual assessment of lung CT scan images. However, this method exhibits significant limitations. A single lung CT scan can generate 100–200 images with varying slice thicknesses. With the rapid increase in screening demands, physicians are required to process massive volumes of imaging data within limited time frames. Sustained high-intensity workloads not only exacerbate physical and mental stress among medical professionals but also elevate the risk of nodule misdiagnosis or oversight due to visual fatigue and cognitive overload. Furthermore, given the complexity and multi-layered nature of CT imaging, complete reliance on manual slice-by-slice interpretation is inefficient, and prolongs diagnostic cycles, and may also delay optimal treatment opportunities for patients. Consequently, developing a novel approach capable of assisting physicians in achieving accurate and efficient pulmonary nodule detection holds substantial clinical value and promising application prospects. This initiative also aligns with the national policy advocated by the National Health Commission to promote the integration of “artificial intelligence with healthcare.”

The advancement of deep learning technology has provided robust technical support for medical image analysis (Haq, 2022). Deep learning technology is fundamentally built upon a multi-layered neural network architecture. The core strength of this architecture lies in its ability to automatically extract high-level features and abstract representations from large-scale datasets, thereby eliminating the dependence on manual feature engineering required by conventional methodologies (Milletari et al., 2016). Convolutional Neural Networks (CNNs) have demonstrated exceptional performance in the field of image recognition and have been widely adopted as fundamental network architectures, thereby laying a solid foundation for subsequent research. Marques et al. (2021) proposed a convolutional neural network-based method for classifying the malignancy degree of pulmonary nodules. Zuo et al. (2019) proposed a multi-resolution convolutional neural network (CNN) for classifying candidate pulmonary nodules. The task of pulmonary nodule detection faces challenges due to the small target volume, morphological variability, and blurred boundaries. Traditional convolutional neural networks (CNNs) exhibit certain performance limitations in handling such multi-scale features and small object detection. In contrast, the Transformer architecture, leveraging its powerful capability to model global dependencies, has demonstrated significant advantages in the field of computer vision and has gradually achieved prominent results in object detection tasks, thereby offering a new technological pathway for pulmonary nodule detection (Carion et al., 2020). Transformer architecture has demonstrated outstanding performance across various visual tasks. However, in the specific application scenario of pulmonary nodule detection, its practical effectiveness remains constrained by several inherent challenges. Due to the typically small size, diverse morphology, and blurred boundaries of pulmonary nodules, existing Transformer models still exhibit deficiencies in effectively capturing multi-scale contextual information and processing small targets. Furthermore, inherent limitations in computational complexity and inference speed make it difficult for such models to meet the stringent real-time requirements of clinical practice while maintaining high detection accuracy.

A multi-module collaboratively optimized architecture is proposed to address key challenges in detecting minute pulmonary nodules. To improve the accuracy of existing detection models in handling pulmonary nodules with complex morphology, small size, and ambiguous boundaries, this study proposed the PrecisionMicro-DETR model based on real-time detection Transformer model RT-DETR (Zhao et al., 2024). Its core contribution lies in the integration of three tailored modules, each designed to tackle a specific difficulty: (1) The Small-target-oriented Strengthened Feature Fusion module (SSTF) (Sunkara and Luo, 2022) helps alleviate the loss of fine-grained features in deep networks through lossless downsampling and high-resolution feature fusion; (2) The Modulation Fusion Module (MFM) (Deng et al., 2025) employs a dynamic weight allocation mechanism to adaptively fuse multi-scale features, thereby improving localization in regions with blurred boundaries between lesions and normal tissues; (3) The lightweight SNI-GSConvE Neck (Li et al., 2024; Li, 2024) reduces computational cost while maintaining feature alignment quality via soft nearest-neighbor interpolation and efficient convolutional units, addressing the practical need for real-time performance and lower resource overhead in clinical settings. Together, these modules form an end-to-end collaborative optimization framework, offering a balanced and effective approach for achieving both high detection accuracy and computational efficiency in medical imaging applications.

2 Related work

2.1 Detection methods based on convolutional neural networks

With the rapid development of deep learning technology, convolutional neural networks (CNNs) have achieved remarkable success in the field of medical image processing, promoting the application of various CNN-based object detection models in pulmonary nodule detection tasks. Among them, the YOLO series models have attracted widespread attention in pulmonary nodule detection research due to their ability to balance high accuracy and real-time performance. For example, the YOLOv5-CASP model proposed by Ji et al. (2023) enhances feature extraction and multi-scale fusion capabilities for small pulmonary nodules by introducing the CBAM attention mechanism, improving the ASPP module, and replacing standard convolutions with CoT modules, thereby significantly improving detection accuracy. However, the added modules also substantially increase computational complexity without fully considering model efficiency. The plug-and-play pulmonary nodule detection solution proposed by Tang et al. (2025) enhances the model’s perception of nodules of different sizes by constructing a multi-scale dual-branch attention mechanism, and designs a cross-layer aggregation module to mitigate detail loss during feature transmission, significantly improving the localization capability of small nodules while maintaining detection accuracy. The YOLO-MSRF model proposed by Wu et al. (2024) effectively improves the detection accuracy of small pulmonary nodules by introducing three key enhancements: a small-target detection layer, a multi-scale receptive field module, and efficient omnidirectional convolution.

Beyond the aforementioned end-to-end detection models, another widely adopted technical paradigm involves using pre-trained deep convolutional neural networks as feature extractors, combined with traditional machine learning classifiers for final diagnosis. For instance, Lanjewar et al. (2023) proposed a modified DenseNet201 model, which incorporates pooling and Dropout layers for a lightweight design to extract high-level features from CT images. These features are then refined using feature selection methods such as ETC and MRMR before being fed into various machine learning classifiers, reportedly achieving high classification accuracy on a specific dataset. Such hybrid approaches leverage the strengths of deep learning in feature representation and the high efficiency of classical machine learning algorithms. However, the feature extraction and classification stages are disconnected, preventing true end-to-end optimization. More importantly, their performance heavily depends on the effectiveness of feature selection, and they still suffer from issues such as feature loss and inadequate representation when dealing with morphologically complex and minute pulmonary nodules. Furthermore, most of these studies are validated on single, small-scale datasets, leaving their generalization capability and clinical interpretability insufficiently substantiated.

Although current research has made significant progress in improving the accuracy of pulmonary nodule detection, there remains a notable shortcoming in the synergistic optimization of accuracy and computational efficiency, making it difficult to meet the stringent computational resource requirements of clinical practice. Particularly given the characteristics of pulmonary nodule detection tasks—where small targets and multi-scale distribution coexist—existing methods still lack a comprehensive solution capable of simultaneously addressing both detection accuracy and operational efficiency.

Therefore, a major focus and ongoing challenge in current research is how to develop a detection model capable of achieving end-to-end collaborative optimization while possessing high detection accuracy, strong generalization capability, good clinical interpretability, and high computational efficiency. Bridging this gap is of significant importance for advancing the clinical adoption of intelligent pulmonary nodule detection technologies.

2.2 Transformer-based detection methods

The Transformer architecture leverages self-attention mechanisms to capture global dependencies in parallel, overcoming limitations in long-range modeling. Its exceptional scalability has established it as a foundational technology in both natural language processing and computer vision. With ongoing technological evolution, Transformer-based approaches are gradually being applied in the medical field, demonstrating potential to surpass traditional methods, particularly in medical image analysis (Carion et al., 2020). To address the morphological and positional complexity of pulmonary nodules in medical images, researchers have begun introducing adaptable modules such as deformable convolutions into the DETR framework to enhance its perception and localization capabilities for irregular targets (Han et al., 2023). Tang et al. (2025) proposed an enhanced pulmonary nodule detection algorithm based on RT-DETR, named LN-DETR. The proposed algorithm improves cross-scale feature fusion via newly-designed deep and shallow detail fusion layers, optimizes the computational load of the backbone network to reduce model size, and enhances contextual information by using efficient downsampling methods. Zhou et al. (2025) proposed a Transformer-based pulmonary nodule detection model named LN-DETR, which innovatively integrates three core modules: the PC-EMA module enhances feature extraction capability while optimizing computational efficiency through multi-scale attention and partial convolution mechanisms; the GS-CCFM module facilitates effective cross-scale feature fusion using grouped shuffle convolution; and the CTrans module further improves overall feature fusion through cross-channel attention. This model significantly enhances computational efficiency while maintaining detection accuracy.

Building upon existing Transformer models, this study systematically addresses key issues in pulmonary nodule detection, such as inadequate multi-scale feature fusion and inefficient channel interaction, by designing multi-scale attention and cross-layer fusion modules, thereby achieving simultaneous improvements in detection accuracy and computational efficiency. However, the method still requires enhancement in capturing features of very small nodules, and its adaptability to nodules with complex morphologies needs further validation. More experiments are needed to support its practicality and robustness in real clinical environments.

3 Materials and methods

In the task of CT pulmonary nodule detection, the features of tiny targets are highly susceptible to being lost in deep networks, which is the primary cause of missed detection of small nodules. The goal is to significantly reduce computational complexity while maintaining feature alignment quality, ensuring that the model can still output accurate nodule boundary information under limited medical computing resources. To this end, PrecisionMicro-DETR was constructed, with its structure illustrated in Figure 1. In the head, the SPDConv module is introduced, which utilizes a lossless downsampling mechanism from space to depth. This maintains the high resolution of feature maps while expanding the receptive field, significantly enhancing the model’s ability to retain features of small pulmonary nodules and effectively addressing the feature loss problem of small targets in convolutional networks. In response to the diverse morphology of pulmonary nodules and their complex associations with surrounding tissues, traditional single-scale convolutions struggle to effectively capture multi-form features. The CSP-OmniKernel module is employed, leveraging a parallel architecture of local branches, large receptive field branches, and global attention branches. This enables multi-scale collaborative perception of the nodule’s own texture, associations with surrounding blood vessels, and overall regional context, improving recognition accuracy for irregular nodules and complex cases. During multi-scale feature fusion, irrelevant tissue information from features at different levels can easily lead to semantic conflicts. The MFM module utilizes a dynamic weight allocation mechanism to adaptively evaluate the importance of each feature branch. This suppresses background interference and enhances lesion features during the fusion process, significantly improving nodule localization accuracy in low-contrast lung tissue backgrounds. To address the conflict between feature alignment distortion and computational efficiency, the SNI-GSConvE module employs soft nearest-neighbor interpolation to calibrate feature responses and incorporates a lightweight dual-path convolutional unit.

Figure 1

3.1 Strengthen the integration of small target features module

To address the issue of insufficient feature representation capability in the baseline RT-DETR model for small target detection tasks, this study systematically improves the model from two dimensions: feature pyramid construction and cross-scale feature fusion. The existing model’s strategy of only fusing features from the P3-P5 levels results in the ineffective utilization of high-resolution features containing critical details. To overcome this limitation, this paper introduces a Space-to-Depth Convolution (SPDConv) module to perform deep enhancement on the P2-level features, which are rich in spatial details. By reorganizing spatial dimensional information into channel dimensions, this module expands the receptive field while maintaining the high spatial resolution of the feature maps, thereby enhancing semantic representation capability while preserving fine spatial structures and significantly improving the feature discriminability of small targets. The enhanced P2 features are fused into the P3 level through the lateral connection mechanism of the feature pyramid, enabling the model to obtain high-quality small target representations in the early stages of feature extraction.

3.1.1 SPDConv module

Since pulmonary nodules generally appear as small-scale targets in CT images, coupled with their morphological diversity and boundary ambiguity, traditional detection methods often suffer from insufficient feature extraction and missed detection issues. To address this limitation, this study adopts SPDConv (Space-to-Depth Convolution) (Sunkara and Luo, 2022), a network architecture specifically designed to optimize low-resolution inputs and small object detection tasks. The core structure of the SPDConv module consists of a space-to-depth layer followed by a non-strided convolutional layer in sequential order. The working mechanism of SPDConv involves the space-to-depth layer restructuring spatial dimensional information of the feature map into the channel dimension, thereby achieving a lossless downsampling. Subsequently, the non-strided convolutional layer compresses the number of channels while utilizing its learnable parameters to efficiently integrate and transform these features. Figure 2 illustrates the process of handling an intermediate feature map (, is the spatial dimension—height and width, and is the number of channels.) of arbitrary size using interval sampling with a scale factor of 2, resulting in four sub-feature maps : ,, and , each with a shape of . To create the new feature map , all sub-feature maps are concatenated along the channel axis. Subsequently, is obtained by applying a non-strided convolution with filters. The transformation from to , involving changes in both size and the number of channels, is conceptually equivalent to performing a strided convolution with altered channel dimensions on the original feature map. However, the key distinction lies in the fact that no pixel information is lost throughout this process.

Figure 2

3.1.2 CSP-OmniKernel

The CSP-OmniKernel module (Cui et al., 2024) introduced in this study adopts a dual-path architecture that integrates feature preservation and feature enhancement. One branch performs deep feature transformation through the OmniKernel module, while the other branch preserves the original features. Both branches are adjusted via 1 × 1 convolutions before fusion, achieving a balance between maintaining original information and enhancing features. After preprocessing with a 1 × 1 convolution, the OmniKernel module processes features in parallel through three distinct branches. The local branch employs depthwise separable convolution for local modulation. The large branch utilizes multi-scale depthwise separable convolutions (1 × 31, 31 × 1, 31 × 31) to capture strip-shaped contextual information. The global branch achieves cross-domain global modeling through dual-domain channel attention and frequency-domain spatial attention modules. After summing and fusing the outputs from each branch, feature modulation is finalized via a 1 × 1 convolution. The structure is illustrated in Figure 3. Specifically, for the input feature map , a 1 × 1 convolutional layer is first applied to perform channel adjustment, resulting in . As shown in Equation 1:

Figure 3

Here, is the input feature map; and are the weights of the 1 × 1 convolutions.

Next, the feature map is divided into two parts: one part undergoes processing through the OmniKernel module (OKM), while the other retains the original information. The two are concatenated with weights α and 1-α, then fused through a second 1 × 1 convolution to output the final feature map , as shown in Equation 2:

Here, is the feature split ratio; OKM is the OmniKernel module, used to enhance local and global information; Cat denotes the concatenation operation along the channel dimension; and is the final output feature map of the module.

The Dual-Domain Channel Attention Module (DCAM) enhances feature representation through dual-path processing in both the spatial and frequency domains. The spatial path extracts channel statistics using global pooling, while the frequency path performs spectral analysis via FFT/IFFT transformations, combined with 1 × 1 convolutions for feature restructuring. The Frequency Domain Spatial Attention Module (FSAM) then applies dual convolutional transformations to the DCAM output, strengthening key frequency components through spectral weighting, and finally restores the features to the spatial domain to improve detail reconstruction quality.

3.2 Modulation fusion module (MFM)

In the field of medical image analysis, challenges such as blurred boundaries and significant scale differences between lesion areas and normal tissues often arise, manifested as the dilution of semantic information during fusion and the difficulty of fixed fusion coefficients in adapting to dynamically changing imaging scenarios. To address these challenges, this paper introduces a lightweight Modulation Fusion Module (MFM) (Deng et al., 2025), whose structure is illustrated in Figure 4. This module dynamically allocates weights to different input branches during the feature fusion process, enhancing the interaction capability among multi-scale features and thereby enabling efficient integration of cross-level semantic information. This method demonstrates strong adaptability and robustness in detecting low-resolution, small-scale targets such as micro-lesions in CT images. First, the module uses a 1 × 1 convolution to align the features of each branch with the target dimension C. Specifically, for the input feature map with input channels: If ≠ C, a single 1 × 1 convolution projects it into the C-dimensional space. Otherwise, an identity mapping is applied to preserve the original features and avoid redundant computations. All aligned feature maps are concatenated along the channel dimension to form the tensor . It is then reshaped into the tensor to preserve the independent semantics of each branch. Here, ℝ indicates that each element in the tensor belongs to the real number domain. This is the most common case, as feature values and computations in deep learning are typically based on floating-point numbers. is Batch Size. Represents the number of samples (such as images) processed simultaneously in one forward pass. This is crucial for parallel computation and efficient training. is Input Channels. Denotes the depth of the feature map. For intermediate feature maps in the network, it represents the number of features extracted by that layer. is Height. The number of pixels or feature points in the vertical direction of the feature map. is Width. The number of pixels or feature points in the horizontal direction of the feature map.

Figure 4

For the input feature map , it is first projected to the target channel number C through a 1 × 1 convolution. The aligned features from all branches are concatenated along the channel dimension to form a tensor , which is then reshaped into to preserve the semantics of each branch. Subsequently, the global contextual representation is computed, as shown in Equation 3:

Here, GAP stands for Global Average Pooling, andrepresents the aggregated global context representation.

weight vector for each branch is generated by a two-layer MLP (performing dimensionality reduction followed by expansion). This facilitates the adaptive learning of the relative importance of features from each branch according to the current context. The detailed computation is provided in Equation 4.

Here, represents the adaptive weight matrix. The and up of represent dimensionality reduction and dimensionality enhancement, respectively.

Finally, The MFM module achieves effective fusion of multi-scale semantics by performing a weighted summation of the weights with the aligned features , The fused feature is obtained through a weighted summation. As shown in Equation 5:

3.3 SNI-GSConvE neck

To enhance the accuracy and robustness of CT pulmonary nodule detection under constrained real-time computational budgets, this study redesigns the neck structure of the model, with a focus on improving two core components: upsampling and feature fusion (Li et al., 2024; Li, 2024). These improvements are integrated into a novel module termed the Rethinking Features-Fused-Pyramid-Neck (RFPN). Traditional methods exhibit notable limitations: nearest-neighbor upsampling, by simply replicating low-frequency semantic information and directly overlaying it with high-frequency shallow textures, tends to induce feature misalignment and noise amplification. Meanwhile, commonly used CBS modules suffer from weak channel interaction capabilities and inefficient receptive field expansion under limited computational budgets, thereby constraining the quality of multi-scale feature fusion.

To address the upsampling stage, this paper introduces a soft nearest-neighbor interpolation method, whose mathematical expression is given by Equation 6:

is the nearest neighbor upsampling operation. X is the input feature map. is the upsampling factor, and the scaling factor is used to calibrate feature responses to avoid cross-layer feature imbalance.

This method applies a soft scaling factor related to the upsampling factor after upsampling, performing regional normalization on high-level semantic features to calibrate their feature responses in a smooth manner. This strategy effectively mitigates cross-layer feature imbalance without introducing any trainable parameters, significantly improving the recall rate and boundary segmentation quality of small-sized pulmonary nodules while preserving detailed textures. The structural design is illustrated in Figure 5.

Figure 5

During the feature fusion stage, this paper adopts the lightweight aggregation unit GSConvE-I, whose structure is illustrated in Figure 6. This module takes the output X from the previous stage as input and initially performs channel compression and alignment through a 1 × 1 convolution, as shown in Equation 7:

Figure 6

Subsequently, the feature flow is split into two parallel processing paths: one path preserves the main information flow, while the other sequentially extracts deep features through 3 × 3 standard convolution, depthwise convolution, and the GELU activation function, as shown in Equation 8:

Finally, cross-path information interaction is achieved through channel concatenation and shuffling, as illustrated in Equation 9:

Here, represents depthwise separable convolution. represents the channel shuffle operation.

This design significantly enhances channel interaction capability and effective receptive field while maintaining low computational complexity. It effectively suppresses feature aliasing and imaging artifacts, providing more discriminative multi-scale feature representations for the detection head.

4 Experiments

Based on the Tianchi dataset, this study conducts a comprehensive comparison between the proposed detection method and existing algorithms, including mAP curve analysis and visual comparison of detection results. Subsequently, through systematic ablation experiments, the performance contributions of each core module of the model are validated. Furthermore, k-fold cross-validation experiments are conducted to reduce the bias and variance introduced by inappropriate data partitioning. Finally, real-world medical images from hospital databases are retrieved for testing.

4.1 Dataset and preprocessing

In this study, to evaluate the effectiveness of the model, we utilized a widely recognized public dataset for pulmonary nodule detection: the Tianchi Pulmonary Nodule Dataset (Cloud, 2017). The data were divided into training set (70%), validation set (15%), and test set (15%). Stratified sampling was employed to ensure that key dimensions such as nodule size, pathological type, and imaging characteristics maintained distributions consistent with the original dataset. The partitioning was conducted at the patient level to ensure that slices from the same patient did not span across subsets, thereby preventing data leakage and enhancing reproducibility. To validate the model’s practical application value in real clinical environments, this study specifically selected CT imaging data from 50 confirmed pulmonary nodule patients randomly retrieved from the hospital PACS system, ensuring data accuracy and reliability.

4.2 Experimental setting and evaluation metrics

4.2.1 Experimental setting

During the model training phase, the experiments were conducted on a Windows 11 system using Python 3.11 and the PyTorch 1.30 framework. The hardware configuration consisted of an Intel(R) Core(TM) i5-14600KF CPU, an NVIDIA GeForce RTX 5060Ti GPU, and 16 GB of RAM. The network model parameters in this experiment were set as follows: the training epoch number was set to 300, and the batch size was set to 16.

Several commonly used evaluation metrics in the fields of medical imaging and object detection were employed to assess the performance of the proposed object detection method. These metrics include Precision (P), Recall (R), mean Average Precision (mAP), and the number of model parameters. The formulas for “Precision” and “Recall” are provided in Equations 10 and 11:

is the positive sample correctly detected by the model. is the positive sample detected by model error detection. stands for False Negative.

AP (Average Precision) is an indicator used to measure the detection accuracy of the model. It reflects the average performance accuracy across different categories by calculating the area under the Precision–Recall (P–R) curve. The calculation formula for Average Precision (AP) is shown in Equation 12:

Here, Pre is Precision, which refers to the proportion of samples predicted as positive by the model that are truly positive. Rec is Recall, which refers to the proportion of all true positive samples correctly predicted by the model.

The mAP (mean Average Precision) is the average of all categories of AP, obtained by summing and averaging each AP value, The calculation formula is shown in Equation 13:

To further analyze the characteristics of the model, this study evaluates it from two dimensions: computational complexity and structural efficiency. The parameter count (Parameters) reflects the model capacity, while the number of Giga Floating-Point Operations (GFLOPs) represents the computational cost. The formulas for calculating model parameters and GFLOPs are provided in Equations 14 and 15, respectively:

where is the convolutional kernel size, is the number of output channels, is the number of output channels, and represents the spatial dimensions of the output feature map.

The F1 Score is the harmonic mean of Precision and Recall, calculated as Equation 16:

4.3 Performance evaluation and loss analysis of the trained model

Figure 7 illustrates the performance evolution of the PrecisionMicro-DETR model over 300 training epochs. The model achieved its best performance on the mAP₅₀ metric, rapidly converging to 0.9 and maintaining stability, demonstrating its outstanding capability in identifying positive samples.

Figure 7

The loss convergence characteristics of PrecisionMicro-DETR during the training process are shown in Figure 8. The curve indicates that the loss value declined rapidly in the initial training phase and then stabilized. The overall convergence trajectory is smooth and free of significant fluctuations, indicating a stable and well-optimized model training process.

Figure 8

4.4 Ablation experiments on PrecisionMicro-DETR

Based on the analysis of ablation experiments, as shown in Table 1, the proposed PrecisionMicro-DETR architecture achieves synergistic optimization of detection accuracy and computational efficiency in the task of CT pulmonary nodule detection. Regarding the core metric mAP₅₀ for pulmonary nodule detection, the MFM module demonstrates a significant improvement in the localization accuracy of small nodules, increasing the baseline model’s performance from 0.932 to 0.947, thereby validating its effectiveness in enhancing feature representation capability within complex lung tissue backgrounds. The fully integrated PrecisionMicro-DETR further elevates mAP₅₀ to 0.949, highlighting the synergistic enhancement effect of multiple modules in pulmonary nodule detection.

Table 1

Model	mAP₅₀	Recall	F1-score	Parameters
Baseline	0.932	0.884	0.920	31,106,233
+MFM	0.947	0.904	0.940	19,709,204
+RFPN	0.944	0.886	0.923	19,580,180
+SSTF	0.94	0.895	0.932	20,488,980
+SSTF+RFPN	0.936	0.935	0.932	20,196,116
+SSTF+MFM	0.934	0.927	0.926	20,302,356
+SSTF+MFM+RFPN(This paper: PrecisionMicro-DETR)	0.949	0.942	0.946	20,009,492

Results of the ablation study.

Bolded text indicates the most advantageous value for that metric.

From the perspective of comprehensive detection performance, the trend of F1-Score clearly reflects the continuous optimization of the model in the completeness of pulmonary nodule detection. The MFM module increases the F1 value from the baseline of 0.920–0.940, primarily due to its enhancement in the recall capability for small nodules, with recall rising from 0.884 to 0.904. When the three modules are fully integrated, the F1-Score peaks at 0.9468, while recall significantly improves to 0.942. These results indicate that the model effectively reduces the rate of missed detection of pulmonary nodules while maintaining high-precision recognition, which holds significant value for clinical diagnosis.

In this ablation experiment, the F1 metric, as the harmonic mean of precision and recall, systematically reveals the contribution of each module to the comprehensive performance of the model. The continuous improvement in the F1 score—from 0.920 of the baseline model to 0.946 of the fully integrated architecture—validates the effectiveness of the module integration strategy. Among these, the MFM module demonstrates the most significant individual improvement effect, elevating the F1 value to 0.9403, primarily due to its multi-scale feature fusion capability enhancing the detection of small pulmonary nodules. Notably, when all three modules are fully integrated, the F1 score reaches its peak of 0.9468, while recall improves to 0.942. This indicates that the model significantly enhances nodule detection ability while maintaining high precision, which is of great importance for reducing missed diagnosis rates in clinical practice.

The experimental data further reveal the synergistic mechanisms among the modules. The combination of MFM and RFPN yields superior results compared to other dual-module configurations, indicating functional complementarity between the two in feature fusion and feature pyramid optimization. In contrast, the relatively lower F1 score observed with the SSTF and MFM combination suggests that careful design is required for module compatibility. Regarding model efficiency, all improved schemes maintain steady growth in F1-Score while reducing the parameter count by over 35%, demonstrating that this optimization approach not only ensures model lightweighting but also enhances overall performance through more refined architectural design.

The experimental results of this study demonstrate that PrecisionMicro-DETR, through the organic integration of multiple modules, successfully overcomes the trade-off between accuracy and computational complexity that traditional models face in CT pulmonary nodule detection tasks. By significantly reducing model parameters while comprehensively improving detection performance, this characteristic provides a new technical pathway for developing clinically applicable pulmonary nodule-assisted diagnostic systems, holding substantial clinical application value.

Figure 9 visualizes the attention regions of the models in the ablation study through heatmaps, where warm tones (red) indicate high attention and cool tones (blue) indicate low attention. The analysis reveals that: (a) The attention of the baseline model RT-DETR-R34 is the most dispersed, with numerous highlighted yet non-lesion responses, indicating that its attention is easily distracted by surrounding tissues; (b) After introducing partial improvement modules, the model’s attention to lesion regions becomes more concentrated, but significant distracting responses still persist in the background, suggesting that the feature fusion and selection mechanisms are not yet fully developed; (c) The complete PrecisionMicro-DETR model (integrating modules such as SSTF and MFM) demonstrates highly focused attention on the lesion core and its edges, with background noise effectively suppressed. This demonstrates that the proposed multi-module collaborative mechanism can guide the model to allocate limited computational resources more precisely to discriminative lesion features, thereby achieving systematic improvement in detection performance while controlling computational complexity.

Figure 9

4.5 Comparison with state-of-the-art detectors

To comprehensively evaluate the performance of PrecisionMicro-DETR, this study conducted a horizontal comparison with multiple mainstream detection models. The baseline model in this study is RT-DETR-34. In the comparative experiments, the same dataset split and training parameters were adopted to ensure fairness. PrecisionMicro-DETR demonstrated comprehensive competitive advantages, with the experimental results summarized in Table 2.

Table 2

Model	Precision	Recall	mAP₅₀	mAP_50-95	Parameters	GFLOPs
YOLOv6	0.768	0.711	0.766	0.494	4,233,843	11.8
YOLOv8	0.905	0.877	0.906	0.66	3,005,843	8.1
YOLOv9	0.9061	0.85	0.89	0.55	60,797,222	266.1
YOLOv10	0.843	0.852	0.878	0.68	2,694,806	8.2
YOLO11	0.902	0.843	0.908	0.613	2,582,347	6.3
YOLO12	0.876	0.794	0.84	0.607	2,508,539	5.8
YOLOv13	0.794	0.668	0.741	0.469	2,448,090	6.2
Faster R-CNN	0.837	0.765	0.8090	0.42	–	–
SDD	0.634	0.530	0.582	0.19	–	–
RT-DETR-EfficientViT (Liu et al., 2023)	0.8909	0.835	0.859	0.504	10,702,612	27.2
RT-DETR -MobileNetV4 (Qin et al., 2024)	0.944	0.896	0.934	0.678	11,310,292	39.5
RT-DETR-R50	0.927	0.911	0.93	0.678	41,956,163	129.5
RT-DETR-R34	0.96	0.884	0.932	0.687	31,106,233	88.8
RT-DETR-R18	0.917	0.916	0.9351	0.7631	20,083,028	58.3
PrecisionMicro-DETR(This paper)	0.952	0.942	0.949	0.698	20,009,492	62.4

Comparative results of different models.

Compared to the YOLO series, the proposed method significantly outperformed YOLOv8 (0.906), YOLOv9 (0.89), and YOLOv10 (0.878) in terms of the mAP₅₀ metric, while also surpassing YOLOv9, which has 60.79 million parameters, in computational efficiency. In comparison with traditional detection algorithms, PrecisionMicro-DETR achieved a 17.3% improvement over Faster R-CNN (0.809) and a 63.1% improvement over SSD (0.582) in mAP₅₀, highlighting the advantages of the Transformer-based detection framework in the field of medical imaging.

When compared to models within the same series, PrecisionMicro-DETR exhibited a unique balance of performance. Although RT-DETR R18 achieved a relatively high mAP_50-95 score of 0.763, its mAP₅₀ was 0.935, which is lower than the 0.949 achieved by the proposed method. More importantly, while maintaining high precision and a parameter count comparable to RT-DETR-R18, PrecisionMicro-DETR improved the recall rate from 0.916 to 0.942, which holds significant clinical importance for reducing the missed diagnosis rate of pulmonary nodules.

A comprehensive analysis indicates that PrecisionMicro-DETR successfully achieves a synergistic optimization of precision and efficiency. The overall improvement in key metrics such as mAP₅₀ and recall rate, combined with a substantial reduction in parameter count and computational load, demonstrates that the proposed architectural enhancements effectively address the challenge of balancing small target recognition and computational efficiency in medical image detection. Compared to existing mainstream algorithms, the proposed method significantly reduces computational resource requirements while maintaining competitive detection accuracy, offering a more viable solution for the clinical deployment of CT pulmonary nodule detection.

The value of this study lies in proposing a detection framework that balances precision and efficiency, providing new technical insights for the field of medical image analysis. Future work will focus on further optimizing the model structure and validating its generalization capability on more medical image datasets.

4.6 K-fold cross-validation experiments

K-fold cross-validation systematically evaluates model performance across different data distributions by partitioning the dataset into multiple complementary subsets, providing more robust evaluation results compared to a single train-test split. This method maximizes the utilization of limited data resources while ensuring the independence of the test set, allowing each sample to participate once in testing and K-1 times in training—making it particularly suitable for application scenarios with limited data scales, such as medical imaging. The results of this K-fold validation are presented in Table 3, which indicates that Fold-1 achieved the best performance, with precision (95.2%), recall (94.2%), and mAP₅₀ (94.9%) all reaching high levels. In contrast, Fold-4 exhibited relatively weaker performance, with significant declines across all metrics. Although the PrecisionMicro-DETR model demonstrated outstanding performance under optimal conditions (e.g., Fold-1), its performance stability requires further improvement. The relatively poor performance of Fold-4 may be attributed to data quality issues or class imbalance, revealing potential problems in the model and providing direction for subsequent optimization.

Table 3

K-fold	Precision	Recall	mAP₅₀	F1-score
Fold-1	0.952	0.942	0.949	0.946
Fold-2	0.865	0.855	0.846	0.860
Fold-3	0.878	0.858	0.867	0.868
Fold-4	0.838	0.802	0.781	0.819
Fold-5	0.916	0.911	0.925	0.913

K-fold cross-validation results on the TianChi dataset.

To investigate the reasons for the relatively lower performance of Fold-4 (mAP 78.1%), this paper additionally trained several mainstream detection models as baseline comparisons on the same data split (i.e., Fold-4 of the Tianchi dataset), including YOLOv5, RT-DETR-R18, R34, and R50. All models were trained using identical train-validation-test splits and hyperparameter settings to ensure fair comparison. The experimental results, as shown in Table 4, indicate that the mAP of all compared models on Fold-4 is significantly lower than their respective average performance on other folds. Under this specific split, this paper’s model (78.1%) still consistently outperforms all compared models. These results strongly suggest that the performance fluctuation in Fold-4 is primarily due to the inherent challenges of this particular data subset (Fold-4), rather than unique shortcomings of our model. Further analysis of the dataset reveals that this fold may contain more small and indistinct nodules, or nodules distributed in more challenging locations (such as adjacent to the pleura or blood vessels), which inherently pose greater difficulties for any detection model.

Table 4

Model	Precision	Recall	mAP₅₀	F1-score
RT-DERT-R18	0.807	0.797	0.750	0.802
RT-DETR-R34	0.812	0.811	0.760	0.811
RT-DETR-R50	0.835	0.771	0.724	0.802
YOLOv5	0.725	0.612	0.679	–
This paper (Fold-4)	0.838	0.802	0.781	0.819

Fold-4 cross-validation results on the TianChi dataset.

Bolded text indicates the most advantageous value for that metric.

5 Discussion

To systematically evaluate the clinical application value of PrecisionMicro-DETR detection results, this study further explores the advantages of this method through quantitative morphological analysis of detected nodules. Clinical imaging data randomly selected from the PACS system of Shunde Hospital, Guangzhou University of Chinese Medicine were used in the experiment to ensure the reliability of the detection results.

As shown in Figure 10, the proposed method accurately identifies nodules distributed across various regions of the lungs, including challenging locations such as those near the parietal pleura, with annotation regions highly consistent with the actual lesion morphology. In contrast, other comparative models exhibited varying degrees of missed or false detections in these complex cases.

Figure 10

According to the local enlarged view in Figure 11, PrecisionMicro-DETR achieved a detection confidence of 0.87 for small nodules, demonstrating the model’s excellent performance in pulmonary nodule detection tasks. Compared to other advanced detection models, the proposed method outputs significantly higher confidence scores for suspicious lesion regions. Such performance not only reflects the accuracy of the model’s judgments but also indicates its higher reliability in assisting clinical decision-making. However, the generalization ability of the proposed method remains to be validated. Subsequent studies will further evaluate the model’s practical generalization performance on multi-center datasets.

Figure 11

The performance of the model largely depends on the distribution of the training data. For example, in our K-fold cross-validation (Section 4.6, Table 3), we observed that when a particular fold (e.g., Fold-4) contains more small and indistinct nodules or nodules in special locations (such as a large number of subpleural nodules), the performance of all compared models (including baseline models) declines simultaneously. This clearly demonstrates that the inherent imbalance and challenging nature of the data are the primary factors contributing to performance fluctuations. Although our model still maintains a relative advantage in such cases, reducing sensitivity to specific data distributions remains key to improving robustness.

While PrecisionMicro-DETR achieves superior detection performance, particularly for small nodules, its parameter count warrants a balanced discussion regarding the trade-off between capability and efficiency. The design ethos intentionally prioritizes diagnostic sensitivity (recall) over minimalist design, given the clinical imperative to minimize missed nodules. Compared to the baseline RT-DETR-R34, our model reduces parameters by 35% while significantly improving all key metrics (Table 1), demonstrating a targeted and effective rather than bloated design. The introduced modules (SSTF, MFM, SNI-GSConvE) are themselves lightweight innovations that address specific weaknesses in small object detection. We argue that for a life-critical diagnostic aid operating on standard clinical hardware, this level of complexity is justified by the substantial gain in reliability. Nevertheless, we acknowledge several limitations that point to future improvements: (1) Model Compression Potential: While efficient, the architecture may still accommodate advanced compression techniques such as pruning or quantization for further streamlining. (2) Inference Speed in Real-Time Workflows: Although designed for clinical hardware, extensive latency benchmarks in end-to-end diagnostic pipelines are needed to ensure seamless integration. Future work will actively explore these directions to enhance practicality without compromising the hard-won clinical efficacy.

6 Conclusion

This study proposes a PrecisionMicro-DETR model designed to assist in detecting small lesions and alleviating the workload of clinicians. Through comparative experiments with different models, ablation studies, K-fold cross-validation, and validation with real CT images from the PACS system of Shunde Hospital, Guangzhou University of Chinese Medicine, the following conclusions are drawn:

Ablation experiments on the Tianchi dataset show that after integrating the small-target feature fusion module SSTF, the MFM module, and the SNI-GSConv Neck structure, the PrecisionMicro-DETR model achieved mAP₅₀, Recall, and F1-Score values of 0.949, 0.942, and 0.946, respectively, representing improvements of 1.8, 6.6, and 2.9% over the baseline model. The gradual introduction of each module significantly enhanced the model’s ability to detect small pulmonary nodules, demonstrating that multi-module collaborative optimization is a key mechanism for performance improvement.
The proposed PrecisionMicro-DETR model for CT pulmonary nodule detection demonstrates comprehensive advantages in both accuracy and efficiency through comparative experiments. The experimental results indicate that the model achieved an mAP₅₀ of 0.949 and a recall of 0.942, outperforming existing mainstream detection models, which holds significant value for reducing clinical missed diagnosis rates. In terms of model efficiency, the method requires only 20 million parameters, achieving comprehensive superiority in key metrics while maintaining a model complexity comparable to RT-DETR-R18. Its parameter count is only 32.9% of that of YOLOv9, demonstrating excellent computational efficiency. Comprehensive analysis confirms that the proposed architectural improvements effectively address the challenge of balancing small-target detection and computational resources in medical imaging.
K-fold cross-validation revealed certain performance fluctuations, with precision ranging from 83.86 to 95.2% and recall ranging from 80.20 to 94.2%. These fluctuations reflect the inherent complexity of medical imaging data and suggest room for improvement in the model’s adaptability to different data distributions. Despite these fluctuations, the model maintained stable high performance in most folds, such as in Fold-5, where all metrics remained above 91%, indicating robust baseline performance. By employing K-fold cross-validation, this study effectively addressed the challenge of limited medical imaging data scale, maximizing data utilization while providing a more robust evaluation of model performance.

Future work will strive to collect and construct a multi-center, multi-device clinical CT image dataset, and employ domain adaptation or test-time augmentation techniques to enable the model to robustly adapt to varying imaging conditions across different hospitals.

Statements

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/supplementary material.

Ethics statement

Ethical approval was not required for the study involving humans in accordance with the local legislation and institutional requirements. Written informed consent to participate in this study was not required from the participants or the participants’ legal guardians/next of kin in accordance with the national legislation and the institutional requirements.

Author contributions

JC: Project administration, Supervision, Methodology, Conceptualization, Visualization, Validation, Software, Writing – original draft, Formal analysis, Writing – review & editing, Resources, Funding acquisition, Investigation, Data curation. JZ: Writing – review & editing, Software. YL: Validation, Supervision, Writing – review & editing. FD: Writing – review & editing, Validation, Project administration. LF: Writing – review & editing, Data curation. HL: Writing – review & editing, Project administration, Data curation.

Funding

The author(s) declared that financial support was not received for this work and/or its publication.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that Generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1
BrayF.FerlayJ.SoerjomataramI.SiegelR. L.TorreL. A.JemalA. (2018). Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: A Cancer Journal for Clinicians, 68, 394–424. doi: 10.3322/caac.21492
- CrossRef
- Google Scholar
2
CarionN.MassaF.SynnaeveG. (2020). “End-to-end object detection with transformers” in European conference on computer vision (Cham: Springer International Publishing), 213–229.
- Google Scholar
3
CloudA. (2017). Tianchi medical AI competition [season 1]: intelligent diagnosis of pulmonary nodules. Available online at: https://tianchi.aliyun.com/competition/entrance/231601/information (Accessed April, 2017).
- Google Scholar
4
CuiY.RenW.KnollA. (2024). “Omni-kernel network for image restoration” in Proceedings of the AAAI conference on artificial intelligence, vol. 38 (Washington, D.C.: AAAI), 1426–1434.
- Google Scholar
5
DengMSunSLiZHuXWuX. (2025). FMNet: frequency-assisted mamba-like linear attention network for camouflaged object detection. arXiv preprint arXiv:2503.11030. Available online at: https://arxiv.org/pdf/2503.11030?
- Google Scholar
6
HanL.LiF.YuH.XiaK.XinQ.ZouX. (2023). BiRPN-YOLOvX: a weighted bidirectional recursive feature pyramid algorithm for lung nodule detection. J. Xray Sci. Technol.31, 301–317. doi: 10.3233/XST-221310,
7
HaqI U. (2022). An overview of deep learning in medical imaging. arXiv preprint arXiv:2202.08546. Available online at: https://arxiv.org/pdf/2202.08546
- Google Scholar
8
HenschkeC. I. (2001). Early lung cancer action project: overall design and findings from baseline screening. Cancer89, 2474–2482. doi: 10.1002/1097-0142(20001201)89:11+3.0.CO;2-2,
9
JiZ.WuY.ZengX.AnY.ZhaoL.WangZ.et al. (2023). Lung nodule detection in medical images based on improved YOLOv5s. IEEE Access11, 76371–76387. doi: 10.1109/access.2023.3296530
- CrossRef
- Google Scholar
10
JinH.YuC.GongZ.ZhengR.ZhaoY.FuQ. (2023). Machine learning techniques for pulmonary nodule computer-aided diagnosis using CT images: a systematic review. Biomed. Signal Process. Control79:104104. doi: 10.1016/j.bspc.2022.104104
- CrossRef
- Google Scholar
11
KarkiR. F. A. (2017). Multiple pulmonary nodules in malignancy. Curr. Opin. Pulm. Med.23, 285–289. doi: 10.1097/MCP.0000000000000393,
12
LanjewarM. G.PanchbhaiK. G.CharanarurP. (2023). Lung cancer detection from CT scans using modified dense net with feature selection methods and ML classifiers. Expert Syst. Appl.224:119961. doi: 10.1016/j.eswa.2023.119961
- CrossRef
- Google Scholar
13
LiH. (2024). “Rethinking features-fused-pyramid-neck for object detection” in European conference on computer vision (Cham: Springer Nature Switzerland), 74–90.
- Google Scholar
14
LiH.LiJ.WeiH.LiuZ.ZhanZ.RenQ. (2024). Slim-neck by GSConv: a lightweight-design for real-time detector architectures. J. Real-Time Image Proc.21:62. doi: 10.1007/s11554-024-01436-6
- CrossRef
- Google Scholar
15
LiR.XiaoC.HuangY.HassanH.HuangB. (2022). Deep learning applications in computed tomography images for pulmonary nodule detection and diagnosis: a review. Diagnostics12:298. doi: 10.3390/diagnostics12020298,
16
LiuX.PengH.ZhengN. (2023). “Efficientvit: memory efficient vision transformer with cascaded group attention” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (New York: IEEE), 14420–14430.
- Google Scholar
17
MarquesS.SchiavoF.FerreiraC. A.PedrosaJ.CunhaA.CampilhoA. (2021). A multi-task CNN approach for lung nodule malignancy classification and characterization. Expert Syst. Appl.184:115469. doi: 10.1016/j.eswa.2021.115469
- CrossRef
- Google Scholar
18
MazzoneP. J.LamL. (2022). Evaluating the patient with a pulmonary nodule: a review. JAMA327, 264–273. doi: 10.1001/jama.2021.24287,
19
MilletariF.NavabN.AhmadiS. A. (2016). “V-net: fully convolutional neural networks for volumetric medical image segmentation” in 2016 fourth international conference on 3D vision (3DV) (IEEE), 2016:565–571. Available online at: https://arxiv.org/pdf/1606.04797
- Google Scholar
20
QinD.LeichnerC.DelakisM. (2024). “Mobile net V4: universal models for the mobile ecosystem” in European conference on computer vision (Cham: Springer Nature Switzerland), 78–96.
- Google Scholar
21
SunkaraR.LuoT. (2022). “No more strided convolutions or pooling: a new CNN building block for low-resolution images and small objects” in Joint European conference on machine learning and knowledge discovery in databases (Cham: Springer Nature Switzerland), 443–459.
- Google Scholar
22
TangS.BaoQ.JiQ.WangT.WangN.YangM.et al. (2025). Improvement of RT-DETR model for ground glass pulmonary nodule detection. PloS one20:e0317114. doi: 10.1371/journal.pone.0317114
- CrossRef
- Google Scholar
23
TangC.ZhouF.SunJ.ZhangY. (2025). Lung-YOLO: multiscale feature fusion attention and cross-layer aggregation for lung nodule detection. Biomed. Signal Process. Control99:106815. doi: 10.1016/j.bspc.2024.106815
- CrossRef
- Google Scholar
24
WuX.ZhangH.SunJ.WangS.ZhangY. (2024). YOLO-MSRF for lung nodule detection. Biomed. Signal Process. Control94:106318. doi: 10.1016/j.bspc.2024.106318
- CrossRef
- Google Scholar
25
ZhangX.YangL.LiuS.CaoL.WangN.LiH.et al. (2024). Interpretation of the 2022 global cancer statistics report. Chin. J. Cancer46, 710–721. doi: 10.3760/cma.j.cn112152-20240416-00152
- CrossRef
- Google Scholar
26
ZhaoY.LvW.XuS. (2024). “Detrs beat yolos on real-time object detection” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024:16965–16974. Available online at: https://openaccess.thecvf.com/content/CVPR2024/html/Zhao_DETRs_Beat_YOLOs_on_Real-time_Object_Detection_CVPR_2024_paper.html
- Google Scholar
27
ZhouD.XuH.LiuW.LiuF. (2025). LN-DETR: cross-scale feature fusion and re-weighting for lung nodule detection. Sci. Rep.15:15543. doi: 10.1038/s41598-025-00309-7,
28
ZuoW.ZhouF.LiZ.WangL. (2019). Multi-resolution CNN and knowledge transfer for candidate classification in lung nodule detection. IEEE Access7, 32510–32521. doi: 10.1109/access.2019.2903587
- CrossRef
- Google Scholar

Summary

Keywords

CT images, multi-scale features, object detection, pulmonary nodule detection, RT-DETR

Citation

Chen J, Zhu J, Lin Y, Deng F, Fu L and Liao H (2026) PrecisionMicro-DETR: enhancing small pulmonary nodule detection in CT scans with multi-scale feature fusion and lightweight design. Front. Comput. Sci. 8:1763780. doi: 10.3389/fcomp.2026.1763780

Received

09 December 2025

Revised

26 January 2026

Accepted

27 January 2026

Published

11 February 2026

Volume

8 - 2026

Edited by

Xiaohao Cai, University of Southampton, United Kingdom

Reviewed by

Madhusudan Lanjewar, Goa University, India

Liying Han, Hebei University of Technology, China

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Huilian Liao, liaohuilian@gzucm.edu.cn

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Computer Vision

METHODS article

PrecisionMicro-DETR: enhancing small pulmonary nodule detection in CT scans with multi-scale feature fusion and lightweight design

Abstract

1 Introduction

2 Related work