Interdisciplinary approaches to image processing for medical robotics

Chen, Ludan; Wu, Shiwen; Leung, Stephen C. H.

doi:10.3389/fmed.2025.1564678

ORIGINAL RESEARCH article

Front. Med., 02 June 2025

Sec. Precision Medicine

Volume 12 - 2025 | https://doi.org/10.3389/fmed.2025.1564678

Interdisciplinary approaches to image processing for medical robotics

Ludan Chen¹

Shiwen Wu¹^*

Stephen C. H. Leung²

¹Armed Police General Hospital Clinical College, Anhui Medical University, Hefei, Anhui, China
²Department of Engineering, The University of HongKong, Hong Kong, China

Introduction: The advancement of medical robotic systems highlights the critical need for precise and high-quality visual data, particularly in low-quality imaging scenarios. This study explores the interdisciplinary physics underlying image fusion and analysis, addressing challenges such as integrating complementary features, handling dynamic range variations, and suppressing noise in real-world medical contexts.

Methods: We introduce the Multi-Scale Feature Adaptive Fusion Network (MFAFN) and the Dynamic Feature Refinement Strategy (DFRS), which leverage principles from computational and experimental physics to enhance imaging techniques. MFAFN applies multi-scale feature extraction, attention-based alignment, and adaptive fusion to improve spatial and spectral integration while preserving crucial details. Complementing this, DFRS employs saliency-based weighting, context-aware mechanisms, and dynamic normalization to refine feature importance and mitigate inconsistencies.

Results: This interdisciplinary approach bridges computational physics, non-linear systems, and technological development, delivering significant improvements in fusion quality metrics such as spatial consistency, edge retention, and noise suppression.

Discussion: Our findings contribute to advancing medical robotics by integrating novel physical principles into imaging methodologies, supporting sustainable innovations in healthcare technology.

1 Introduction

The development of low-quality image fusion and analysis technologies is pivotal for enhancing medical robot vision, especially in environments where imaging data is noisy, incomplete, or low in resolution (1). Medical robots rely heavily on visual information for navigation, diagnosis, and surgical precision, yet real-world conditions often lead to compromised image quality (2). This research field is essential not only for improving robot-assisted medical outcomes but also for enabling the deployment of robotic systems in under-resourced healthcare settings (3). By integrating advanced fusion techniques with robust analytical algorithms, researchers can optimize medical robots to process low-quality visual data effectively (4). This approach ensures reliability, safety, and accuracy, addressing the pressing need for intelligent systems capable of functioning under suboptimal conditions (5). The evolution of this technology has followed a trajectory from traditional image processing techniques to data-driven machine learning approaches and, more recently, to deep learning and pre-trained models, each with its own advantages and limitations (6).

Traditional methods based on symbolic AI and knowledge representation were the first attempts to address the challenges of low-quality image fusion and analysis (7). These methods employed rule-based algorithms to enhance images by applying predefined transformations, such as noise reduction, contrast adjustment, and edge detection (8). Techniques like histogram equalization and wavelet transforms were commonly used to improve image quality for analysis (9). These methods were limited by their deterministic nature and inability to adapt to varying image conditions (10). While they provided a foundation for understanding image enhancement, symbolic approaches often failed to handle complex or noisy datasets effectively (11). They lacked the capability to integrate multiple sources of low-quality images into a cohesive representation, making them insufficient for the demanding requirements of medical robotics (12).

To overcome the limitations of symbolic methods, data-driven approaches using traditional machine learning were introduced (13). These methods relied on supervised learning algorithms to fuse and analyze low-quality images, leveraging labeled datasets to train models for tasks like segmentation, feature extraction, and object recognition (14). Techniques such as Support Vector Machines (SVM) and Random Forests were employed to identify patterns in low-quality visual data, enabling medical robots to process and interpret images more effectively (15). These methods allowed for multi-modal image fusion, combining information from different imaging modalities, such as X-rays and MRIs (16). Despite their adaptability and improved performance over symbolic methods, data-driven approaches were constrained by their dependence on large, annotated datasets. The variability in medical imaging data further complicated model training, often resulting in limited generalization and suboptimal performance in unseen scenarios (17).

The advent of deep learning and pre-trained models has revolutionized low-quality image fusion and analysis for medical robot vision (18). Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs) have demonstrated remarkable capabilities in enhancing and analyzing low-quality images (19). Pre-trained models like U-Net and ResNet have been fine-tuned for medical imaging tasks, achieving significant improvements in segmentation, anomaly detection, and image synthesis. Furthermore, these models enable end-to-end learning, integrating fusion and analysis in a unified framework. Deep learning also excels in processing multi-modal data, providing robust solutions for integrating information from disparate imaging sources (20). However, these methods come with challenges, such as high computational requirements and potential biases introduced during training. The dependency on large-scale, high-quality datasets for model training further restricts their applicability in resource-constrained environments. Despite these limitations, deep learning has set a new benchmark for low-quality image processing, paving the way for innovative applications in medical robotics.

Recognizing the constraints of existing methods, this study proposes a novel approach to low-quality image fusion and analysis tailored for medical robot vision. By combining lightweight deep learning architectures with advanced domain adaptation techniques, the proposed framework addresses the computational and generalization challenges of current methods. Incorporating unsupervised learning enables the model to adapt to unlabeled data, enhancing its applicability in diverse medical scenarios. The framework integrates multi-scale feature extraction with attention mechanisms to optimize image fusion and ensure accurate analysis under suboptimal conditions.

We summarize our contributions as follows:

• The framework introduces a novel lightweight deep learning architecture with multi-scale feature extraction for efficient low-quality image processing in medical robotics.

• It leverages unsupervised domain adaptation, ensuring adaptability and generalization across diverse imaging conditions without requiring extensive labeled datasets.

• Experimental results demonstrate significant improvements in image quality enhancement and diagnostic accuracy, achieving superior performance compared to traditional and existing deep learning methods.

2 Related work

2.1 Image fusion for low-quality inputs

Image fusion techniques have become essential for medical robot vision, particularly when dealing with low-quality images acquired under challenging conditions (21). These techniques combine information from multiple images or sensors to generate a single enhanced image that retains critical features for analysis (22). In medical robotics, the quality of visual input is paramount, as it directly influences decision-making, precision, and safety during surgical or diagnostic procedures (23). Traditional image fusion methods, such as multi-scale decomposition and intensity-hue-saturation (IHS) transformations, have been extensively utilized in medical imaging. These methods are computationally efficient and capable of preserving critical spatial and spectral information (24). However, their effectiveness diminishes when dealing with highly degraded or noisy inputs. Recent advances in deep learning have introduced neural network-based fusion methods, such as CNNs and GANs, which demonstrate superior performance in handling low-resolution or noisy images. For instance, neural fusion techniques can integrate complementary data from modalities like MRI and CT scans to enhance the interpretability of fused images (25). In medical robots, such fused images enable more accurate object recognition, obstacle avoidance, and navigation in complex environments. Despite these advancements, challenges persist, including computational complexity, real-time processing requirements, and the difficulty of fusing heterogeneous image sources. Future developments are likely to focus on lightweight and adaptive algorithms that cater to the specific demands of medical robotic systems (26).

2.2 Noise reduction in medical imaging

Noise reduction is a critical component in the analysis of low-quality images for medical robot vision (27). The presence of noise can obscure important details, leading to errors in diagnosis or surgical operations. Noise in medical imaging can arise from various sources, including sensor limitations, environmental interference, and motion artifacts (28). Addressing these challenges requires robust denoising algorithms tailored to medical applications. Traditional approaches, such as Gaussian filtering, median filtering, and wavelet thresholding, have been widely employed to suppress noise while preserving important image features (29). While these methods are effective for general applications, they often struggle with the trade-off between noise removal and the retention of fine details (30). Recent advancements in deep learning have introduced novel denoising architectures, such as autoencoders and transformer-based models, which achieve state-of-the-art results in preserving fine-grained details in noisy medical images (31). In the context of medical robots, these noise reduction techniques are particularly valuable for enhancing the accuracy of real-time image analysis. For example, robotic-assisted surgeries often rely on endoscopic or laparoscopic imaging, where low light and narrow fields of view exacerbate noise issues (26). Advanced denoising methods allow for clearer visualization of anatomical structures, improving the precision of robotic manipulations. However, achieving real-time denoising while maintaining high accuracy remains a significant challenge, prompting ongoing research into lightweight and hardware-accelerated solutions (32).

2.3 Deep learning for vision analysis

Deep learning has revolutionized the field of computer vision, offering unparalleled performance in tasks such as object detection, segmentation, and classification (33). Its application to medical robot vision has similarly transformed the capabilities of robotic systems in clinical settings. Convolutional neural networks (CNNs) and their variants have demonstrated remarkable success in analyzing low-quality medical images, providing robust solutions to challenges posed by noise, blur, and low resolution (34). In medical robotics, deep learning models are employed to identify and track anatomical landmarks, detect abnormalities, and guide robotic instruments with high precision (35). For instance, deep learning-based segmentation algorithms enable accurate delineation of organs and tissues in endoscopic images, even under poor lighting or occlusion (36). These models also play a critical role in ensuring the safety and efficacy of robotic procedures by detecting and compensating for errors in real time (37). Another emerging trend is the use of multi-task learning frameworks, which allow a single deep learning model to perform multiple vision-related tasks simultaneously, such as denoising, segmentation, and anomaly detection (38). This approach is particularly advantageous for medical robots, as it reduces computational overhead while ensuring comprehensive visual analysis (39). However, the deployment of deep learning models in medical robotics faces challenges, including the need for extensive labeled datasets, domain adaptation to diverse imaging conditions, and compliance with regulatory standards for clinical use (40).

3 Method

3.1 Overview

Image fusion is a critical process in the field of computer vision, aimed at combining relevant information from multiple images into a single, enhanced image. This technique finds extensive applications in areas such as medical imaging, remote sensing, and surveillance. The primary objective of image fusion is to integrate complementary features from different sources while preserving essential details and minimizing distortions.

This section outlines the methodology and contributions of our work in image fusion. In Subsection 3.2, we formalize the problem of image fusion and introduce the foundational principles underpinning our approach. This includes defining the mathematical framework for multi-source image analysis and fusion, emphasizing clarity and rigor. Subsection 3.3 delves into the limitations of existing image fusion techniques, including their inability to effectively handle high-dimensional data or maintain consistency across varying scales. We mathematically analyze the challenges associated with feature extraction, alignment, and noise suppression, highlighting the need for innovative methods to address these issues. Subsection 3.4presents our novel image fusion model and strategy. This includes a detailed explanation of a new architecture designed to enhance feature integration while preserving critical spatial and spectral details. Our approach leverages state-of-the-art techniques in deep learning and optimization to achieve superior performance. By addressing gaps in existing methodologies, our work aims to set a new benchmark in the domain of image fusion.

3.2 Preliminaries

To enhance the clarity and reproducibility of our methodology, we provide a consolidated notation system and explicit definitions for all mathematical variables and operators used in this section. Let {I₁, I₂, …, I_N} be a set of N input images, where each $I_{i} \in ℝ^{H \times W}$ represents an image with height H and width W. The goal of image fusion is to generate a single image F ∈ ℝ^H×W that preserves the most salient and complementary information from the inputs. The fusion operation is described by a function $F (\cdot)$ such that:

\begin{array}{l} F = F (I_{1}, I_{2}, \dots, I_{N}) . & (1) \end{array}

Each image I_i is passed through a multi-scale encoder, producing a set of feature maps ${F_{i}^{(l)}}_{l = 1}^{L}$ , where l denotes the scale level, and $F_{i}^{(l)} \in ℝ^{H_{l} \times W_{l} \times C}$ is the feature representation at that level. Feature extraction is denoted by Φ(·), alignment by A(·, ·), and attention modulation by a weighting operator $W (\cdot)$ . The attention scores are computed per scale, and the aligned features are denoted ${\hat{F}}_{i}^{(l)}$ . To ensure consistency in intensity across modalities, we apply a normalization function g(·). Feature importance is guided by saliency Sal(·), computed via spatial gradients ∇F, and evaluated with metrics such as structural similarity index (SSIM), entropy H(·), and edge-preservation criteria. Learnable weights $w_{i}^{(l)} \in [0, 1]$ control the contribution of each image at scale l, with the constraint:

\begin{array}{l} \sum_{i = 1}^{N} w_{i}^{(l)} = 1, \forall l \in {1, \dots, L} . & (2) \end{array}

These notations are consistently used throughout the remainder of this section to define and analyze each computational stage in our proposed fusion framework.

Image fusion is a process of combining multiple images from different sources into a single image that retains the most relevant and complementary information. Let {I₁, I₂, …, I_N} represent a set of input images from different modalities or sensors, where each $I_{i} \in ℝ^{H \times W}$ has height H and width W. The goal is to produce a fused image F ∈ ℝ^H×W that incorporates the salient features of all input images while reducing redundancy and preserving spatial and spectral information.

The process of image fusion can be formulated as an optimization problem. Let $F (\cdot)$ be the fusion function such that:

\begin{array}{l} F = F (I_{1}, I_{2}, \dots, I_{N}), & (3) \end{array}

where $F$ combines relevant features from ${I_{i}}_{i = 1}^{N}$ . The objective of $F$ is to maximize information content and minimize artifacts:

\begin{array}{l} F^{*} = arg min_{F} L (F; I_{1}, I_{2}, \dots, I_{N}), & (4) \end{array}

where $L$ is a loss function that measures the quality of the fused image F based on metrics such as structural similarity, entropy, or gradient consistency.

To effectively integrate information from multiple images, multi-resolution analysis is often employed. Given an input image I_i, we decompose it into a set of resolution levels ${I_{i}^{(1)}, I_{i}^{(2)}, \dots, I_{i}^{(L)}}$ using a transform such as wavelets or Laplacian pyramids. Each level $I_{i}^{(l)}$ corresponds to a particular spatial or frequency scale:

\begin{array}{l} I_{i}^{(l)} = T_{l} (I_{i}), & (5) \end{array}

where T_l denotes the transformation operator for level l. The fusion process then integrates features across all levels:

\begin{array}{l} F^{(l)} = F_{l} ({I_{i}^{(l)}}_{i = 1}^{N}), & (6) \end{array}

and the fused image F is reconstructed as:

\begin{array}{l} F = T^{- 1} ({F^{(l)}}_{l = 1}^{L}), & (7) \end{array}

where T⁻¹ denotes the inverse transformation.

Key to the success of image fusion is accurate feature extraction and alignment. For each image I_i, we extract features Φ(I_i) using a suitable method:

\begin{array}{l} Φ (I_{i}) = {f_{k} (I_{i}) ∣ k = 1, \dots, K}, & (8) \end{array}

where f_k(·) represents a feature extractor such as edge detection, texture analysis, or deep neural networks.

The alignment ensures that features across input images correspond spatially:

\begin{array}{l} \hat{Φ} (I_{i}) = A (Φ (I_{i}), Φ (I_{j})), & (9) \end{array}

where A is an alignment function that minimizes disparities between features in I_i and a reference image I_j.

Evaluating the quality of fusion involves several metrics. Commonly used measures include: The fused image should preserve spatial details present in the input images:

\begin{array}{l} Q_{spatial} = \sum_{i = 1}^{N} SSIM (F, I_{i}), & (10) \end{array}

where SSIM is the structural similarity index.

The fused image should maximize entropy:

\begin{array}{l} Q_{info} = H (F), & (11) \end{array}

where H(·) denotes entropy.

Gradients in the fused image should align with those of the input images:

\begin{array}{l} Q_{edge} = \sum_{i = 1}^{N} || \nabla F - \nabla I_{i} {||}_{2}^{2} . & (12) \end{array}

Input images often contain noise, which can propagate during fusion. Let N_i represent noise in I_i. The fusion function must minimize noise propagation:

\begin{array}{l} F (I_{1} + N_{1}, \dots, I_{N} + N_{N}) \approx F (I_{1}, \dots, I_{N}) . & (13) \end{array}

Input images may have varying intensity ranges. A normalization step g(I_i) can be applied to ensure uniformity:

\begin{array}{l} I_{i}^{norm} = g (I_{i}), g (\cdot) = \frac{I_{i} - min (I_{i})}{max (I_{i}) - min (I_{i})} . & (14) \end{array}

3.3 Multi-scale feature adaptive fusion network (MFAFN)

To address the challenges of effectively combining complementary information from multiple input images while preserving their unique details, we propose the MFAFN. This model integrates advanced feature extraction, attention mechanisms, and multi-scale representations to achieve high-quality image fusion (as shown in Figure 1).

Figure 1

Figure 1. Overview of the multi-scale feature adaptive fusion network (MFAFN). The MFAFN architecture utilizes a multi-scale feature extraction mechanism, attention-driven feature alignment, and adaptive weighted fusion to effectively combine complementary information from multiple input images at different spatial resolutions. The model captures fine-grained details and global context at each scale, using a learnable attention mechanism to highlight the most relevant features. Adaptive fusion aggregates the aligned features from each input image to produce a high-quality fused output.

3.3.1 Multi-scale feature extraction

In this work, we introduce a novel multi-scale feature extraction mechanism that decomposes each input image into multiple feature levels, facilitating the simultaneous capture of both fine-grained details and global context (as shown in Figure 2). The primary objective of this approach is to create a set of feature maps at different spatial resolutions, which enables the model to process image information at varying scales and adapt to diverse spatial structures. To achieve this, each input image I_i is passed through a convolutional encoder network Φ, which progressively extracts multi-scale features at different levels of abstraction. The input image I_i is decomposed into a set of feature maps at L different scales as follows:

\begin{array}{l} Φ (I_{i}) = {F_{i}^{(1)}, F_{i}^{(2)}, \dots, F_{i}^{(L)}}, & (15) \end{array}

where $F_{i}^{(l)} \in ℝ^{H_{l} \times W_{l} \times C}$ represents the feature map at scale l, with H_l and W_l denoting the spatial dimensions and C representing the channel depth at that particular scale. The spatial resolution decreases as the scale index l increases, allowing the model to capture both low-level local features and high-level global context. The multi-scale decomposition is achieved by applying a series of convolutional layers with progressively larger receptive fields. The convolutional encoder E consists of multiple layers, each of which captures spatial features at a different scale by employing filters of varying kernel sizes. At each scale l, the feature map $F_{i}^{(l)}$ is computed using the following equation:

\begin{array}{l} F_{i}^{(l)} = E_{l} (I_{i}), l = 1, \dots, L, & (16) \end{array}

where E_l represents the encoder for the l-th scale, and the feature map $F_{i}^{(l)}$ is the output of applying a convolutional operation with a specific kernel size at that level. The encoder layers are designed to capture progressively more abstract and global features as the scale increases, ensuring that both fine-grained textures and high-level semantic information are adequately represented. The multi-scale features are not only generated through different spatial resolutions but also incorporate varying degrees of abstraction at each level. The lower scales focus on fine-grained details, such as edges and textures, while the higher scales capture more global patterns, such as shapes and overall scene structures. This multi-scale representation allows the network to adapt to different types of input images by leveraging features from both local and global contexts. In practice, each input image I_i is processed through these multi-scale feature extraction steps to build a comprehensive feature hierarchy, ensuring that the fusion network can effectively combine complementary information across scales. We apply downsampling operations such as pooling or strided convolutions in the intermediate layers of the encoder, which helps to reduce the spatial resolution and increase the receptive field of the network. This enables the model to capture a broader range of spatial information at higher scales, providing a more holistic view of the input image. The resulting multi-scale feature maps ${F_{i}^{(l)}}_{l = 1}^{L}$ form the foundation for the subsequent stages of the MFAFN, where they will be aligned, fused, and used for reconstruction of the final fused image.

Figure 2

Figure 2. Multi-scale feature extraction in the MFAFN. The input image I_i is processed through a convolutional encoder network Φ that extracts multi-scale feature maps at different spatial resolutions. This decomposition allows the model to capture both fine-grained details and global context, with lower scales focusing on local features such as textures and edges, and higher scales capturing more abstract, global patterns like shapes and structures. The resulting multi-scale feature maps form the foundation for further feature alignment and fusion in the MFAFN.

3.3.2 Attention-driven feature alignment

A central innovation in our method is the use of an attention mechanism applied to each scale of the extracted features, which helps align the multi-scale representations in a way that highlights important information while suppressing irrelevant details. The core idea is to compute a dynamic importance map $A_{i}^{(l)}$ at each scale l, reflecting the relevance of each feature map $F_{i}^{(l)}$ for the fusion task. This attention map is computed by applying a learnable weight matrix W to the feature map at scale l, followed by a softmax operation, as shown in the following equation:

\begin{array}{l} A_{i}^{(l)} = σ (W F_{i}^{(l)}), & (17) \end{array}

where $A_{i}^{(l)}$ is the importance map for the i-th input image at the l-th scale, and σ is the softmax activation function, which normalizes the output of the linear transformation. The softmax ensures that each importance map is a distribution over the spatial locations, with values ranging from 0 to 1, where higher values correspond to more important features. The weight matrix W ∈ ℝ^C×C is learned during training and enables the model to adaptively select the most relevant features based on the task at hand. Once the attention map $A_{i}^{(l)}$ is computed, the aligned feature map ${\hat{F}}_{i}^{(l)}$ for each input image is obtained by element-wise multiplication between the importance map and the corresponding feature map $F_{i}^{(l)}$ :

\begin{array}{l} {\hat{F}}_{i}^{(l)} = A_{i}^{(l)} ⊙ F_{i}^{(l)}, & (18) \end{array}

where ⊙ denotes element-wise multiplication. This operation effectively filters the feature map $F_{i}^{(l)}$ by modulating its values based on the attention weights, thereby highlighting the most relevant features and suppressing less important ones. The attention mechanism ensures that each feature map contributes differently depending on its significance, allowing the fusion network to prioritize critical information across various scales. In practice, the attention maps $A_{i}^{(l)}$ are computed not only for the feature maps of individual input images but also for each scale, which allows the model to adjust its focus at different levels of abstraction. At higher scales, where the network captures more global structures, the attention maps may emphasize larger regions of the image, whereas at lower scales, which capture finer details, the attention maps may focus more locally, enhancing edge details and textures. This multi-scale attention ensures that both fine-grained features and global context are effectively integrated during the fusion process. To further improve the performance of the attention mechanism, we introduce a residual attention mechanism, where the aligned feature map ${\hat{F}}_{i}^{(l)}$ is combined with the original feature map $F_{i}^{(l)}$ to retain both the enhanced and unaltered features:

\begin{array}{l} {\hat{F}}_{i}^{(l)} = F_{i}^{(l)} + A_{i}^{(l)} ⊙ F_{i}^{(l)} . & (19) \end{array}

3.3.3 Adaptive weighted fusion at multiple scales

In the MFAFN model, a key innovation is the introduction of an adaptive weighted fusion strategy that intelligently aggregates the aligned features from multiple input images across different scales. This approach ensures that the most relevant information from each input is preserved and combined in a way that maximizes the quality of the fused output. The fusion mechanism operates at each scale l, where the fused feature map F^(l) is computed by taking a weighted sum of the aligned feature maps ${\hat{F}}_{i}^{(l)}$ from all N input images. The fusion equation is expressed as:

\begin{array}{l} F^{(l)} = \sum_{i = 1}^{N} w_{i}^{(l)} {\hat{F}}_{i}^{(l)}, & (20) \end{array}

where ${\hat{F}}_{i}^{(l)}$ is the aligned feature map for the i-th input image at scale l, and $w_{i}^{(l)}$ are learnable weights that determine the contribution of each image to the fused feature map at scale l. The weights $w_{i}^{(l)}$ are subject to the constraint that they sum to 1 for each scale l, ensuring that the fusion process remains normalized:

\begin{array}{l} \sum_{i = 1}^{N} w_{i}^{(l)} = 1, \forall l . & (21) \end{array}

These learnable weights $w_{i}^{(l)}$ allow the model to adaptively allocate more importance to certain input images at each scale based on their relevance for the current fusion task. By learning these weights during training, the MFAFN model can automatically emphasize the most informative features from each image, while down-weighting less relevant or noisy features. This adaptive weighting mechanism is crucial for tasks where some input images are more reliable or contain more salient information than others. The learnable weights $w_{i}^{(l)}$ are updated during the training process through backpropagation, enabling the model to optimize the fusion strategy for different types of input images. This approach is particularly effective when dealing with heterogeneous images, where different input sources may contain varying levels of detail, noise, or distortions. By allowing the model to adjust the fusion weights at each scale, we ensure that the fused feature maps are contextually optimized for the specific characteristics of the images. Once the feature maps have been fused at each scale, the resulting multi-scale fused feature maps ${F^{(l)}}_{l = 1}^{L}$ are passed through a decoder to reconstruct the final fused image F. The decoder employs transposed convolutions to upsample and integrate the multi-scale features back to the original image resolution:

\begin{array}{l} F = D ({F^{(l)}}_{l = 1}^{L}), & (22) \end{array}

where D is a learnable decoder that combines the fused multi-scale features into a single output image. This reconstruction step ensures that the fused image captures both the fine-grained details and the global structures present in the input images.

3.4 Dynamic feature refinement strategy (DFRS)

The DFRS enhances the robustness and adaptability of the image fusion process in the MFAFN. DFRS addresses key challenges in image fusion, such as feature misalignment, noise propagation, and dynamic range inconsistencies, by incorporating innovative mechanisms that refine and adaptively enhance the fusion process (as shown in Figure 3). The strategy leverages feature normalization, dynamic saliency-based weighting, and context-aware refinement to optimize the quality of the fused image while maintaining essential details.

Figure 3

Figure 3. Dynamic feature refinement strategy (DFRS) in MFAFN. The DFRS enhances image fusion by addressing challenges like feature misalignment, noise, and dynamic range inconsistencies. It incorporates three key mechanisms: feature normalization and calibration, adaptive saliency-based weighting, and context-aware refinement. The feature normalization step ensures all images contribute equally by standardizing intensity variations, while the saliency-based weighting assigns dynamic importance to feature maps based on structural relevance. Context-aware refinement integrates global context to selectively enhance local features, ensuring high-quality fused images with preserved structural details.

3.4.1 Feature normalization and calibration

One of the critical challenges in multi-image fusion tasks is handling the variations in intensity and dynamic range across the input images, which may be captured under different lighting conditions or with different sensors (as shown in Figure 4). These variations can lead to inconsistencies and biases when combining features, resulting in poor fusion quality. To address this, DFRS introduces a feature normalization and calibration step that ensures all input images contribute equally to the fusion process, minimizing the impact of intensity discrepancies. This step first normalizes each input image I_i to a common scale by centering and scaling the pixel values based on the image's mean and standard deviation. The normalization process is given by:

\begin{array}{l} I_{i}^{norm} = \frac{I_{i} - μ (I_{i})}{σ (I_{i})}, & (23) \end{array}

where μ(I_i) and σ(I_i) represent the mean and standard deviation of the pixel values in the image I_i, respectively. This operation shifts and scales the pixel values to have zero mean and unit variance, thus eliminating any intensity bias across different input images. After normalization, all input images are on a comparable intensity scale, ensuring that no single image dominates the fusion process due to extreme intensity variations. To enhance the robustness of the fusion process in the presence of significant dynamic range differences between images, DFRS also introduces a calibration step that adjusts the images' intensity distributions. Calibration is performed using an adaptive method that adjusts the pixel values of each image based on its local context and global characteristics. A dynamic intensity mapping function $M$ is applied to the normalized image to compensate for discrepancies in dynamic range across different images:

\begin{array}{l} I_{i}^{calib} = M (I_{i}^{norm}, {I_{j}^{norm}}_{j = 1}^{N}), & (24) \end{array}

where $M (\cdot)$ is a learned or predefined intensity mapping function that takes into account the statistical properties of the input image set. This step ensures that images with higher dynamic ranges do not disproportionately influence the final fused result, allowing for a more balanced and representative fusion.

Figure 4

Figure 4. Feature normalization and calibration in DFRS. This step addresses intensity variations and dynamic range discrepancies across input images in multi-image fusion tasks. Each image is normalized to a common scale using its mean and standard deviation to eliminate intensity bias. Then, a dynamic intensity mapping function is applied to the normalized images to further adjust intensity distributions and compensate for dynamic range differences, ensuring a balanced contribution of each image to the final fusion result.

3.4.2 Adaptive feature weighting based on saliency

A significant innovation in the DFRS is the dynamic and context-aware determination of feature importance based on saliency scores. This mechanism allows the model to adaptively weight each input feature map based on its relevance to the image structure, enhancing the overall fusion quality. Unlike conventional methods that assign static weights to the feature maps, DFRS computes adaptive weights ${w_{i}^{(l)}}$ for each feature map $F_{i}^{(l)}$ at each scale, which are directly influenced by the saliency of the features. The saliency score is a crucial metric for evaluating the importance of the feature map in terms of its contribution to the structural integrity of the image.

Saliency is defined as the degree to which a feature map $F_{i}^{(l)}$ highlights critical structural or textural information in the image. In DFRS, the saliency score is calculated using a gradient-based activation function that quantifies the strength of the feature map's response to changes in the image. This response is captured through the L1-norm of the gradient of the feature map:

\begin{array}{l} Sal (F_{i}^{(l)}) = || \nabla F_{i}^{(l)} {||}_{1}, & (25) \end{array}

where $\nabla F_{i}^{(l)}$ represents the gradient of the feature map $F_{i}^{(l)}$ with respect to the spatial coordinates of the image. The L1-norm is used to capture the total variation or edge strength in the feature map, which serves as an indicator of the feature's significance for the fusion task. Features with higher gradients indicate sharper edges or more pronounced structures, and are therefore considered more informative for image fusion.

The saliency score $Sal (F_{i}^{(l)})$ is then normalized to compute the adaptive weight $w_{i}^{(l)}$ for each feature map. The weight is derived using the softmax function, which ensures that the weights are positive, normalized, and sum to one across all input images. The adaptive weights are calculated as:

\begin{array}{l} w_{i}^{(l)} = \frac{exp (Sal (F_{i}^{(l)}))}{\sum_{j = 1}^{N} exp (Sal (F_{j}^{(l)}))}, & (26) \end{array}

where N is the number of input images, and the exponential function is applied to the saliency scores to accentuate more salient (informative) feature maps. The use of the softmax function ensures that the weights are comparative, allowing the model to assign higher weights to more significant feature maps while down-weighting less informative ones.

3.4.3 Context-aware feature refinement

To further enhance the fusion quality and ensure that the fused image retains fine-grained details while maintaining global consistency, the DFRS incorporates a context-aware refinement mechanism. This strategy leverages global contextual information to guide the refinement of the feature maps at each scale, providing a higher level of coherence across different image regions. The key innovation lies in the use of a global attention mechanism, which integrates the global context into the refinement process, enabling the model to produce high-quality fused images with preserved structural details. The first step in this context-aware refinement process involves the computation of a global context representation, F_global, which is derived by aggregating the feature maps of all input images at each scale. The global context is computed as the average of all the feature maps $F_{i}^{(l)}$ from the input images at a given scale l:

\begin{array}{l} F_{global} = \frac{1}{N} \sum_{i = 1}^{N} F_{i}^{(l)}, & (27) \end{array}

where N represents the number of input images, and $F_{i}^{(l)}$ is the feature map of the i-th input image at scale l. This operation ensures that F_global captures the overall structure of the input images, aggregating information that is common across all images, while disregarding individual local discrepancies. Once the global context is computed, a global attention mechanism is applied to refine the fused feature map F^(l). The attention mechanism operates by using both the local feature map F^(l) and the global context F_global to produce a context-aware refinement signal, denoted as G^(l):

\begin{array}{l} G^{(l)} = Attention (F^{(l)}, F_{global}), & (28) \end{array}

where Attention(·) represents the attention mechanism that learns to selectively focus on the most relevant parts of the global context for refining the local feature map. This mechanism allows the model to incorporate global information selectively, ensuring that the refinement process enhances the feature map where it is most needed, without over-smoothing or distorting important details. The final step in the refinement process is to update the fused feature map by adding the context-aware refinement signal G^(l) to the original fused feature map F^(l). The refined feature map $F_{refined}^{(l)}$ is computed as:

\begin{array}{l} F_{refined}^{(l)} = F^{(l)} + α G^{(l)}, & (29) \end{array}

where α is a learnable scalar parameter that controls the contribution of the global context refinement. This parameter allows the model to balance the influence of the local feature map F^(l) and the global context F_global, enabling the model to dynamically adjust the amount of global context to be integrated into the final output. The refined feature map $F_{refined}^{(l)}$ is then used for further processing in subsequent stages of the fusion network.

4 Experimental setup

4.1 Dataset

The DRIVE dataset (41) is a benchmark dataset for retinal vessel segmentation, containing high-resolution fundus images. It includes 40 images, split into training and test sets, with manually annotated vessel masks provided by experts. The dataset is widely used for evaluating segmentation algorithms in ophthalmology, offering standardized data for algorithm benchmarking. The Kvasir-SEG dataset (42) is a comprehensive dataset designed for the segmentation of gastrointestinal polyp images. It consists of 1,000 annotated images of varying sizes and resolutions, captured during colonoscopy procedures. The dataset provides pixel-level annotations and is valuable for developing and validating models for medical image segmentation tasks in gastrointestinal disease detection. The AMOS dataset (43) is a multimodal abdominal organ segmentation dataset, including both CT and MRI scans. It features over 500 scans with annotations for multiple abdominal organs, making it a rich resource for evaluating algorithms in 3D medical image segmentation. The dataset is particularly suited for cross-modality research and robust segmentation model development. The CHASE_DB1 dataset (44) is a retinal image dataset for vessel segmentation, consisting of 28 color fundus images with expert annotations. The images cover a wide range of vascular patterns and patient demographics, making it a valuable resource for advancing vessel segmentation techniques in ophthalmology research.

The selection of datasets in this study was guided by the goal of evaluating the model across a diverse range of imaging modalities and anatomical structures. DRIVE and CHASE DB1 represent fundus imaging, Kvasir-SEG provides endoscopic gastrointestinal imagery, and AMOS includes both CT and MRI scans for abdominal organs. This combination allows for assessing the generalization capability of the model across low-contrast retinal vessels, highly variable polyp shapes, and multi-modal volumetric segmentation tasks. However, we acknowledge several limitations in these datasets. Although AMOS offers modality diversity, most datasets focus on 2D static images rather than dynamic sequences commonly encountered in robotic applications. The annotations–though expert-reviewed–may still contain inter-observer variability and do not cover uncertain or ambiguous regions that can occur in clinical practice. The datasets are relatively well-curated and may not reflect the noise, compression artifacts, or motion blur often seen in real-time imaging. These constraints motivate our efforts to conduct further validation under simulated clinical conditions and, eventually, in collaboration with clinical partners using real-world data streams.

4.2 Experimental details

Our experiments were conducted on four publicly available datasets: DRIVE, Kvasir-SEG, AMOS, and CHASE_DB1, focusing on segmentation tasks in the medical imaging domain. Each dataset was preprocessed to ensure consistency and robustness in the training pipeline. For DRIVE and CHASE_DB1, retinal images were resized to a standard resolution of 512 × 512, normalized, and augmented with random rotations, flips, and intensity variations to address data imbalance. Similarly, for Kvasir-SEG and AMOS, image normalization and data augmentation techniques such as random cropping, scaling, and elastic deformations were applied to enhance generalization. Segmentation models were implemented using U-Net and its variants as the baseline architecture. For multimodal datasets like AMOS, encoder-decoder networks with feature fusion strategies were employed to effectively combine information from CT and MRI scans. The models were trained with a combination of Dice loss and cross-entropy loss to handle class imbalance. The optimizer used was Adam with an initial learning rate of 10⁻⁴, and a learning rate scheduler was applied to reduce the rate upon plateauing of validation loss. Training was performed using an 80-20 split for training and validation, with five-fold cross-validation to ensure robustness. The batch size was set to 16 for DRIVE and CHASE_DB1 and 8 for Kvasir-SEG and AMOS due to memory constraints. For evaluation, metrics such as Dice coefficient, Intersection over Union (IoU), Precision, and Recall were employed to comprehensively assess the segmentation performance. The models were trained on NVIDIA A100 GPUs for 100 epochs with early stopping criteria based on validation Dice coefficient. For post-processing, morphological operations and connected component analysis were applied to refine segmentation masks, particularly in cases of fragmented outputs. The implementation was conducted in PyTorch, leveraging state-of-the-art libraries for augmentation and model deployment. All experiments were repeated three times with different random seeds to ensure consistency. Comparisons with state-of-the-art methods were performed to validate the superiority of the proposed approach, and statistical significance tests were conducted to verify the robustness of the results.

To further support reproducibility, we provide more detailed information about our model architecture and training configurations. The encoder in MFAFN consists of four convolutional blocks with kernel size 3 × 3, each followed by batch normalization and GELU activation. The number of feature channels doubles at each downsampling stage, starting from 64 and increasing to 512. The decoder mirrors this architecture with transposed convolutions for upsampling. In DFRS, we implement four multi-frequency multi-scale attention (MFMSA) blocks, each containing a combination of depth-wise separable convolutions, 1 × 1 bottleneck projections, and channel attention mechanisms. For training, the Adam optimizer is used with an initial learning rate of 1e-4. A cosine annealing learning rate scheduler with a warm-up phase of 5 epochs is applied to stabilize convergence. Early stopping is employed based on validation Dice score, with a patience of 10 epochs. The total number of trainable parameters in the MFAFN+DFRS framework is approximately 37 million. All models were implemented using PyTorch 2.0 with CUDA 12.1 on NVIDIA A100 GPUs (40GB), running on Ubuntu 20.04. Random seeds are fixed across experiments for consistency. Detailed training logs, including loss curves and validation metrics, will be made available upon request.

To ensure a fair and rigorous comparison, we harmonized the experimental setup across all baseline methods and our proposed model. All models, including ours and the state-of-the-art methods, were trained and evaluated under the same conditions whenever possible. This includes using identical data preprocessing pipelines such as image resizing, normalization, and augmentation, evaluation metrics Dice, IoU, Precision, Recall, and train/validation splits. For methods with publicly available implementations, we used official codebases and retrained them on the same datasets with matched batch sizes and learning schedules. In cases where training from scratch was not feasible, we used reported results directly from the original papers but ensured the datasets and metrics were aligned. These precautions help to eliminate confounding factors and provide a robust basis for performance comparison.

4.3 Comparison with SOTA methods

We compared the performance of our proposed method with several state-of-the-art (SOTA) approaches on the DRIVE, Kvasir-SEG, AMOS, and CHASE_DB1 datasets. The results, presented in Tables 1, 2, demonstrate that our method achieves superior segmentation performance across all datasets, outperforming existing techniques in terms of Dice coefficient, IoU, Recall, and Precision.

Table 1

Table 1. Comparison of our method with SOTA methods on DRIVE and Kvasir-SEG datasets for medical image segmentation.

Table 2

Table 2. Comparison of our method with SOTA methods on AMOS and CHASE_DB1 datasets for medical image segmentation.

On the DRIVE dataset, our method achieved a Dice coefficient of 88.34%, surpassing the highest SOTA performance of 85.78% by TransUNet. Similarly, the IoU improved from 82.67% (TransUNet) to 85.12%. These improvements can be attributed to the advanced feature extraction and contextual attention mechanisms in our model, which effectively capture fine-grained details in retinal images. On the Kvasir-SEG dataset, our method achieved a Dice score of 90.45% and an IoU of 86.78%, significantly outperforming TransUNet, which recorded 88.34% and 84.12%, respectively. This improvement is largely due to our robust augmentation strategies and efficient feature fusion. For the AMOS dataset, our method recorded a Dice score of 89.34%, compared to 87.56% achieved by TransUNet. The IoU improved from 84.78% (TransUNet) to 86.45%, demonstrating the strength of our method in handling 3D medical image data. On the CHASE_DB1 dataset, our method achieved a Dice score of 88.23%, outperforming the previous best score of 86.45% by TransUNet, and the IoU increased from 83.89% to 85.34%. These results validate the robustness of our model in segmenting challenging vascular structures.

The superior performance across all datasets can be attributed to the following factors: Our model's ability to capture global context and local details using a hybrid architecture that integrates attention and transformer modules; Advanced data augmentation techniques, which enhanced the generalization capability of the model; and The use of optimized loss functions such as Dice loss combined with cross-entropy loss, which addressed class imbalance effectively.

The performance gains observed in Tables 1, 2 can be attributed to the synergistic design of our MFAFN-DFRS framework. The multi-scale feature adaptive fusion mechanism allows the model to preserve both local texture details and global contextual semantics, which is particularly beneficial in datasets like Kvasir-SEG and DRIVE, where the anatomical structures vary in scale and complexity. The dynamic refinement strategy plays a crucial role in improving robustness to noise and modality inconsistencies, especially in multi-source datasets such as AMOS and CHASE DB1. Compared to methods like TransUNet and AttentionUNet, our approach avoids overfitting to local patterns by integrating context-aware refinement with saliency-weighted features. This explains the consistent improvements in Dice and IoU scores across diverse datasets and segmentation tasks.

4.4 Ablation study

To evaluate the contribution of each component in our proposed method, we conducted an ablation study on the DRIVE, Kvasir-SEG, AMOS, and CHASE_DB1 datasets. The results are summarized in Tables 3, 4, showcasing the impact of removing specific components (denoted as Multi-Scale Feature Extraction, Multi-Scale Feature Extraction, and Multi-Scale Feature Extraction) on segmentation performance metrics such as Dice coefficient, IoU, Recall, and Precision.

Table 3

Table 3. Ablation study results on DRIVE and Kvasir-SEG datasets for medical image segmentation.

Table 4

Table 4. Ablation study results on AMOS and CHASE_DB1 datasets for medical image segmentation.

For the DRIVE dataset, removing Multi-Scale Feature Extraction reduced the Dice coefficient from 88.34% to 85.23%, highlighting its crucial role in capturing fine-grained features in retinal images. Similarly, omitting Attention-Driven Feature Alignment resulted in a Dice score of 86.12%, indicating the significance of this module in enhancing contextual understanding. Removing Context-Aware Feature Refinement showed a marginal drop to 87.23%, underlining its role in refining segmentation outputs. For the Kvasir-SEG dataset, the Dice score dropped from 90.45% to 87.45% without Multi-Scale Feature Extraction and to 88.34% without Attention-Driven Feature Alignment, emphasizing the importance of these components in handling variations in gastrointestinal polyp images. On the AMOS dataset, removing Multi-Scale Feature Extraction led to a decrease in the Dice coefficient from 89.34% to 86.12%, demonstrating its importance in processing multimodal data such as CT and MRI scans. Similarly, on the CHASE_DB1 dataset, removing Attention-Driven Feature Alignment reduced the Dice score from 88.23% to 87.45%, highlighting its significance in vascular segmentation. Context-Aware Feature Refinement, while contributing less significantly than Multi-Scale Feature Extraction and Multi-Scale Feature Extraction, still played a role in performance refinement, with Dice scores dropping to 88.12% on AMOS and 88.67% on CHASE_DB1 when it was removed.

The ablation study results validate the importance of each component, as evidenced by the consistent degradation in performance when any component is removed. These findings affirm the robustness and effectiveness of our integrated design for medical image segmentation.

Tables 3, 4 provide clear insights into the contribution of each component within our proposed architecture. The significant drop in performance when removing the Multi-Scale Feature Extraction module confirms its necessity for capturing hierarchical features. Without the Attention-Driven Feature Alignment, the model fails to emphasize structurally important regions, resulting in lower precision and recall. The marginal yet consistent improvement from the Context-Aware Feature Refinement module shows that integrating global contextual cues improves the spatial coherence of segmentation, especially in complex backgrounds. Collectively, these ablation results validate our design choices and highlight the importance of combining scale-awareness, attention mechanisms, and semantic-level refinement to achieve state-of-the-art performance.

To address concerns regarding real-time applicability, we evaluated the inference performance of our model and several baselines on a standardized hardware setup. As shown in Table 5, our proposed MFAFN achieves 39.8 FPS on 512 × 512 resolution input with a single A100 GPU, which is comparable to TransUNet while providing significantly better segmentation performance. Although our model requires slightly higher memory and parameter count due to the multi-scale attention mechanism, the inference time remains within clinically acceptable ranges (< 30 ms per image), supporting its potential deployment in real-time medical imaging systems.

Table 5

Table 5. Inference performance and computational requirements of different models on NVIDIA A100 GPU.

To approximate real-world clinical deployment scenarios, we conducted a series of controlled simulations where DRIVE images were subjected to conditions mimicking common clinical constraints, such as low-light environments, image compression, motion blur, and sensor noise. As shown in Table 6, our model maintains robust performance across all variants, with only moderate reductions in Dice and IoU scores. Notably, recall remains consistently high under degraded conditions, indicating the model's ability to preserve critical anatomical structures even under imperfect inputs. This suggests strong potential for practical application in real-time medical robotic systems operating in challenging environments.

Table 6

Table 6. Performance of our method under simulated real-world clinical conditions (DRIVE dataset).

To provide clarity on deployment constraints, we benchmarked the computational profile of MFAFN + DFRS under varying image resolutions. As shown in Table 7, the model maintains low inference latency (11.3 ms) and modest memory usage at 256 × 256 resolution, making it well-suited for embedded applications such as endoscopic robots or mobile diagnostic units. At 512 × 512, which matches most dataset configurations, the model remains deployable on high-performance GPUs with medium-level constraints. However, at 1024 × 1024, resource demands grow significantly, which may limit deployment on embedded devices without optimization. These results suggest that the model is computationally efficient enough for many real-time clinical applications, particularly when paired with lightweight deployment strategies such as quantization or TensorRT optimization.

Table 7

Table 7. Computational cost and deployment feasibility of MFAFN + DFRS.

5 Discussion

Although this study does not involve any human or animal data directly, we recognize that eventual clinical deployment of the proposed framework in medical robotic systems will require careful attention to ethical, regulatory, and compliance-related considerations. These include patient privacy and data protection, algorithmic fairness and transparency, clinical safety validation, and alignment with medical device regulatory standards such as FDA (U.S.), CE Marking (EU), or NMPA (China). As part of our future research roadmap, we intend to consult with regulatory professionals and institutional ethics committees to ensure that our developments meet the necessary compliance standards and can be responsibly translated into real-world healthcare environments.

Given the high-stakes nature of medical applications, model interpretability remains a critical concern. While our framework primarily focuses on fusion performance and architectural efficiency, we acknowledge the importance of explainability for clinical acceptance. The attention mechanisms and saliency-based weighting components in our design provide a partial pathway for interpretation by highlighting spatial regions of focus during fusion. However, deeper interpretability–such as quantifying uncertainty, visualizing decision pathways, or integrating explainable AI (XAI) modules–remains an open area for future exploration. We plan to extend our work by incorporating post-hoc interpretation techniques and model-inherent transparency to facilitate better understanding and trust in clinical decision support systems.

6 Conclusions and future work

This research addresses the critical need for enhanced visual data quality in medical robotic systems, particularly under challenging low-quality imaging conditions, through an interdisciplinary physics framework. By integrating principles from computational physics, non-linear systems, and technological development, the study advances image fusion methodologies to tackle issues such as effective feature integration, dynamic range management, and noise suppression. The proposed MFAFN and DFRS embody this interdisciplinary approach. MFAFN leverages multi-scale feature extraction, attention-based alignment, and adaptive fusion, enhancing the integration of spatial and spectral data while preserving critical details. DFRS complements MFAFN with dynamic normalization, saliency-based feature refinement, and context-aware noise reduction, collectively improving fusion quality metrics, including spatial consistency, edge retention, and noise suppression. These advancements not only position the MFAFN-DFRS framework as a robust solution for improving medical robot vision but also contribute to the broader field of interdisciplinary physics, with applications spanning computational imaging, non-linear systems, and cyber-physical systems.

Although this work primarily focuses on the algorithmic design and evaluation of a medical image fusion framework, it has been conceived with practical clinical scenarios in mind. The proposed MFAFN-DFRS model addresses common visual challenges encountered in robotic-assisted surgeries and diagnostic imaging, such as low-light conditions, motion artifacts, and multi-modal inconsistencies. These are particularly relevant in minimally invasive procedures, endoscopic operations, and intraoperative navigation, where real-time, high-fidelity visual data is critical for decision-making. While the current study does not involve direct collaboration with clinical professionals, the design of the framework is informed by established needs in robotic workflows. To bridge the gap between technical innovation and medical applicability, we plan to pursue interdisciplinary partnerships with medical experts in future work, aiming to refine system specifications and validate integration into clinical environments.

Despite its promising contributions, the study has certain limitations. The computational complexity of MFAFN and DFRS may challenge real-time implementation in medical robotic systems, especially in resource-constrained environments. Future work could explore optimization techniques rooted in computational physics or hardware acceleration to enhance processing efficiency. Further validation across diverse imaging modalities and clinical scenarios is necessary to ensure broad applicability. Expanding this work to incorporate additional data sources and test it in real-world operational conditions will be critical for scaling the technology. These efforts could enhance the interdisciplinary integration of physics principles in medical robotics, paving the way for more versatile and practical systems that enable safer and more accurate healthcare procedures while advancing the frontiers of sustainable technological innovation.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

LC: Methodology, Supervision, Project administration, Validation, Resources, Visualization, Writing – original draft, Writing – review & editing. SW: Data curation, Conceptualization, Formal analysis, Investigation, Funding acquisition, Software, Writing – original draft, Writing – review & editing. SL: Visualization, Supervision, Funding acquisition, Writing – original draft, Writing – review & editing.

Funding

The author(s) declare that no financial support was received for the research and/or publication of this article.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Gen AI was used in the creation of this manuscript.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1. Wang G. RL-CWtrans Net: multimodal swimming coaching driven via robot vision. Front Neurorobot. (2024) 18:1439188. doi: 10.3389/fnbot.2024.1439188

PubMed Abstract | Crossref Full Text | Google Scholar

2. Pan R. Multimodal fusion-powered English speaking robot. Front Neurorobot. (2024) 18:1478181. doi: 10.3389/fnbot.2024.1478181

PubMed Abstract | Crossref Full Text | Google Scholar

3. Wang J, Cao D, Li Y, Wang J, Wu Y. Multi-user motion recognition using sEMG via discriminative canonical correlation analysis and adaptive dimensionality reduction. Front Neurorobot. (2022) 16:997134. doi: 10.3389/fnbot.2022.997134

PubMed Abstract | Crossref Full Text | Google Scholar

4. Valanarasu JMJ, Oza P, Hacihaliloglu I, Patel VM. Medical transformer: gated axial-attention for medical image segmentation. In: de Bruijne M, Cattin PC, Cotin S, Padoy N, Speidel S, Zheng Y, Essert C, editors. Medical Image Computing and Computer Assisted Intervention - MICCAI 2021. MICCAI 2021. Lecture Notes in Computer Science, vol. 12901. Cham: Springer (2021). p. 36–46. doi: 10.1007/978-3-030-87193-2_4

Crossref Full Text | Google Scholar

5. Bai Y, Chen D, Li Q, lei Shen W, Wang Y. Bidirectional copy-paste for semi-supervised medical image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2023). p. 11514–24. Available online at: http://openaccess.thecvf.com/content/CVPR2023/html/Bai_Bidirectional_Copy-Paste_for_Semi-Supervised_Medical_Image_Segmentation_CVPR_2023_paper.html

Google Scholar

6. Rahman MM, Marculescu R. Medical image segmentation via cascaded attention decoding. In: IEEE Workshop/Winter Conference on Applications of Computer Vision. Waikoloa, HI: IEEE. (2023). doi: 10.1109/WACV56688.2023.00616

PubMed Abstract | Crossref Full Text | Google Scholar

7. Wu J, Fu R, Fang H, Zhang Y, Xu Y. MedSegDiff-V2: diffusion based medical image segmentation with transformer. In: AAAI Conference on Artificial Intelligence. (2023). p. 6030–8. doi: 10.1609/aaai.v38i6.28418

Crossref Full Text | Google Scholar

8. Roy S, Koehler G, Ulrich C, Baumgartner M, Petersen J, Isensee F, et al. MedNeXt: transformer-driven scaling of ConvNets for medical image segmentation. In: Greenspan H, Madabhushi A, Mousavi P, Salcudean S, Duncan J, Syeda-Mahmood T, Taylor R, editors. Medical Image Computing and Computer Assisted Intervention - MICCAI 2023. Lecture Notes in Computer Science, vol. 14223. Cham: Springer (2023). p. 405–15. doi: 10.1007/978-3-031-43901-8_39

Crossref Full Text | Google Scholar

9. Rahman A, Valanarasu JMJ, Hacihaliloglu I, Patel VM. Ambiguous medical image segmentation using diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2023). p. 11536–46. Available online at: http://openaccess.thecvf.com/content/CVPR2023/html/Rahman_Ambiguous_Medical_Image_Segmentation_Using_Diffusion_Models_CVPR_2023_paper.html

Google Scholar

10. Valanarasu JMJ, Patel VM. UNeXt: MLP-based rapid medical image segmentation network. In: Wang L, Dou Q, Fletcher PT, Speidel S, Li S, editors. Medical Image Computing and Computer Assisted Intervention - MICCAI 2022. MICCAI 2022. Lecture Notes in Computer Science, vol 13435. Cham: Springer (2022). p. 23–33. doi: 10.1007/978-3-031-16443-9_3

Crossref Full Text | Google Scholar

11. Butoi VI, Ortiz JJG, Ma T, Sabuncu MR, Guttag J, Dalca AV. UniverSeg: universal medical image segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE (2023). p. 21438–51. Available online at: http://openaccess.thecvf.com/content/ICCV2023/html/Butoi_UniverSeg_Universal_Medical_Image_Segmentation_ICCV_2023_paper.html

Google Scholar

12. Zhang Y, Liu H, Hu Q. TransFuse: fusing transformers and CNNs for medical image segmentation. In: de Bruijne M, Cattin PC, Cotin S, Padoy N, Speidel S, Zheng Y, Essert C, editors. Medical Image Computing and Computer Assisted Intervention - MICCAI 2021. MICCAI 2021. Lecture Notes in Computer Science, vol. 12901. Cham: Springer (2021). p. 14–24. doi: 10.1007/978-3-030-87193-2_2

Crossref Full Text | Google Scholar

13. He A, Wang K, Li T, Du C, Xia S, Fu H. H2Former: an efficient hierarchical hybrid transformer for medical image segmentation. IEEE Trans Med Imaging. (2023) 42:2763–75. doi: 10.1109/TMI.2023.3264513

PubMed Abstract | Crossref Full Text | Google Scholar

14. Zhang Y, Zhou T, Wang S, Liang P, Zhang Y, Chen D. Input augmentation with SAM: boosting medical image segmentation with segmentation foundation model. In: Celebi ME, Salekin MD, Kim H, Albarqouni S, Barata C, Halpern A, Tschandl P, Combalia M, Liu Y, Zamzmi G, Levy J, Rangwala H, Reinke A, Wynn D, Landman B, Jeong W-K, Shen Y, Deng Z, Bakas S, Li X, Qin C, Rieke N, Roth H, Xu D, editors. Medical Image Computing and Computer Assisted Intervention - MICCAI 2023 Workshops. MICCAI 2023. Lecture Notes in Computer Science, vol. 14393. Cham: Springer (2023). p. 129–39. doi: 10.1007/978-3-031-47401-9_13

Crossref Full Text | Google Scholar

15. lu Huang H, Chen Z, Zou Y, Lu M, Chen C. Channel prior convolutional attention for medical image segmentation. Comput Biol Med. (2023) 178:108784. doi: 10.1016/j.compbiomed.2024.108784

PubMed Abstract | Crossref Full Text | Google Scholar

16. You C, Dai W, Min Y, Liu F, Clifton D, Zhou SK, et al. Rethinking semi-supervised medical image segmentation: a variance-reduction perspective. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S, editors. Advances in Neural Information Processing Systems, Vol. 36. Curran Associates, Inc. (2023). p. 9984–10021. Available online at: https://proceedings.neurips.cc/paper_files/paper/2023/hash/1f7e6d5c84b0ed286d0e69b7d2c79b47-Abstract-Conference.html

PubMed Abstract | Google Scholar

17. Xiao H, Li L, yu Liu Q, Zhu X, Zhang Q. Transformers in medical image segmentation: a review. Biomed Signal Proc Cont. (2023) 84:104791. doi: 10.1016/j.bspc.2023.104791

Crossref Full Text | Google Scholar

18. Shajin FH, Devi B, Prakash N, Sreekanth G, Rajesh P. Sailfish optimizer with Levy flight, chaotic and opposition-based multi-level thresholding for medical image segmentation. Soft Comput. (2023) 27:12457–82. doi: 10.1007/s00500-023-07891-w

Crossref Full Text | Google Scholar

19. Chaitanya K, Erdil E, Karani N, Konukoglu E. Contrastive learning of global and local features for medical image segmentation with limited annotations. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, editors. Advances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc. (2020). p. 12546–12558. Available online at: https://proceedings.neurips.cc/paper/2020/hash/949686ecef4ee20a62d16b4a2d7ccca3-Abstract.html

Google Scholar

20. Xie Y, Zhang J, Shen C, Xia Y. CoTr: efficiently bridging CNN and transformer for 3D medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. (2021). doi: 10.1007/978-3-030-87199-4_16

Crossref Full Text | Google Scholar

21. Ulrich C, Isensee F, Wald T, Zenk M, Baumgartner M, Maier-Hein K. MultiTalent: a multi-dataset approach to medical image segmentation. In: Greenspan H, Madabhushi A, Mousavi P, Salcudean S, Duncan J, Syeda-Mahmood T, Taylor R, editors. Medical Image Computing and Computer Assisted Intervention - MICCAI 2023. MICCAI 2023. Lecture Notes in Computer Science, vol. 14222. Cham: Springer (2023). p. 648–58. doi: 10.1007/978-3-031-43898-1_62

Crossref Full Text | Google Scholar

22. Luo X, Chen J, Song T, Chen Y, Wang G, Zhang S. Semi-supervised medical image segmentation through dual-task consistency. In: AAAI Conference on Artificial Intelligence, vol. 35. (2020). p. 8801–9. doi: 10.1609/aaai.v35i10.17066

Crossref Full Text | Google Scholar

23. Wu J, Fu R, Fang H, Zhang Y, Yang Y, Xiong H, et al. MedSegDiff: medical image segmentation with diffusion probabilistic model. In: Medical Imaging with Deep Learning, Vol. 227. PMLR (2024). p. 1623–39. Available online at: https://proceedings.mlr.press/v227/wu24a.html

Google Scholar

24. Jha D, Riegler M, Johansen D, Halvorsen P, Johansen HD. DoubleU-Net: a deep convolutional neural network for medical image segmentation. In: 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS). Rochester, MN: IEEE. (2020).

Google Scholar

25. Müller D, Soto-Rey I, Kramer F. Towards a guideline for evaluation metrics in medical image segmentation. BMC Res Notes. (2022) 15:210. doi: 10.1186/s13104-022-06096-y

PubMed Abstract | Crossref Full Text | Google Scholar

26. Hu Z, Lin H, Wang C. A novel grid multi-structure chaotic attractor and its application in medical image encryption. Front Phys. (2023) 11:1273872. doi: 10.3389/fphy.2023.1273872

Crossref Full Text | Google Scholar

27. Malhotra P, Gupta S, Koundal D, Zaguia A, Enbeyle W. Deep neural networks for medical image segmentation. J Healthc Eng. (2022) 10:2022:9580991. doi: 10.1155/2022/9580991

PubMed Abstract | Crossref Full Text | Google Scholar

28. Yin X, Sun L, Fu Y, Lu R, Zhang Y. U-Net-based medical image segmentation. J Healthc Eng. (2022) 2022:4189781. doi: 10.1155/2022/4189781

PubMed Abstract | Crossref Full Text | Google Scholar

29. Liu Q, Chen C, Qin J, Dou Q, Heng P. FedDG: federated domain generalization on medical image segmentation via episodic learning in continuous frequency space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2021). p. 1013–23. Available online at: http://openaccess.thecvf.com/content/CVPR2021/html/Liu_FedDG_Federated_Domain_Generalization_on_Medical_Image_Segmentation_via_Episodic_CVPR_2021_paper.html

Google Scholar

30. Yao W, Gao K, Zhang Z, Cui L, Zhang J. An image encryption algorithm based on a 3D chaotic Hopfield neural network and random row-column permutation. Front Phys. (2023) 11:1162887. doi: 10.3389/fphy.2023.1162887

Crossref Full Text | Google Scholar

31. Azad R, Aghdam EK, Rauland A, Jia Y, Avval AH, Bozorgpour A, et al. Medical image segmentation review: the success of U-Net. IEEE Trans Pattern Anal Mach Intell. (2022) 46:10076–95. doi: 10.1109/TPAMI.2024.3435571

PubMed Abstract | Crossref Full Text | Google Scholar

32. Huang X, Deng Z, Li D, Yuan X, Fu Y. MISSFormer: an effective transformer for 2D medical image segmentation. IEEE Trans Med Imaging. (2022) 42:1484–94. doi: 10.1109/TMI.2022.3230943

PubMed Abstract | Crossref Full Text | Google Scholar

33. Jin X, Wu N, Jiang Q, Kou Y, Duan H, Wang P, et al. A dual descriptor combined with frequency domain reconstruction learning for face forgery detection in deepfake videos. Forensic Sci Int: Digit Investigat. (2024) 49:301747. doi: 10.1016/j.fsidi.2024.301747

Crossref Full Text | Google Scholar

34. Jin X, Liu L, Ren X, Jiang Q, Lee SJ, Zhang J, et al. A restoration scheme for spatial and spectral resolution of panchromatic image using convolutional neural network. IEEE J Select Top Appl Earth Observat Remote Sens. (2024) 17:3379–93. doi: 10.1109/JSTARS.2024.3351854

Crossref Full Text | Google Scholar

35. Tu X, Yuan Z, Liu B, Liu J, Hu Y, Hua H, et al. An improved YOLOv5 for object detection in visible and thermal infrared images based on contrastive learning. Front Phys. (2023) 11:1193245. doi: 10.3389/fphy.2023.1193245

Crossref Full Text | Google Scholar

36. Jin X, Zhang P, He Y, Jiang Q, Wang P, Hou J, et al. A theoretical analysis of continuous firing condition for pulse-coupled neural networks with its applications. Eng Appl Artif Intell. (2023) 126:107101. doi: 10.1016/j.engappai.2023.107101

PubMed Abstract | Crossref Full Text | Google Scholar

37. Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, et al. Swin-Unet: Unet-like Pure transformer for medical image segmentation. In: Karlinsky L, Michaeli T, Nishino K, editors. Computer Vision - ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol. 13803. Cham: Springer (2021). p. 205–18. doi: 10.1007/978-3-031-25066-8_9

Crossref Full Text | Google Scholar

38. Hatamizadeh A, Yang D, Roth H, Xu D. UNETR: transformers for 3D medical image segmentation. In: IEEE Workshop/Winter Conference on Applications of Computer Vision. Waikoloa, HI: IEEE. (2021). doi: 10.1109/WACV51458.2022.00181

Crossref Full Text | Google Scholar

39. Zong Y, Zuo Q, Ng MKP, Lei B, Wang S, A. new brain network construction paradigm for brain disorder via diffusion-based graph contrastive learning. IEEE Trans Pattern Anal Mach Intell. (2024) 46: 10389–403. doi: 10.1109/TPAMI.2024.3442811

PubMed Abstract | Crossref Full Text | Google Scholar

40. Huang H, Lin L, Tong R, Hu H, Zhang Q, Iwamoto Y, et al. UNet 3+: a full-scale connected UNet for medical image segmentation. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. Barcelona: IEEE (2020). doi: 10.1109/ICASSP40776.2020.9053405

Crossref Full Text | Google Scholar

41. Gehrig M, Aarents W, Gehrig D, Scaramuzza D. Dsec: a stereo event camera dataset for driving scenarios. IEEE Robot Automat Letters. (2021) 6:4947–54. doi: 10.1109/LRA.2021.3068942

Crossref Full Text | Google Scholar

42. Jha D, Smedsrud PH, Riegler MA, Halvorsen P, De Lange T, Johansen D, et al. Kvasir-seg: a segmented polyp dataset. In: MultiMedia Modeling: 26th International Conference, MMM 2020. Daejeon: Springer (2020). p. 451–462.

Google Scholar

43. Thakkar JJ, Thakkar JJ. Applications of structural equation modelling with AMOS 21, IBM SPSS. In: Structural Equation Modelling: Application for Research and Practice (with AMOS and R). Cham: Springer (2020). p. 35–89. doi: 10.1007/978-981-15-3793-6

PubMed Abstract | Crossref Full Text | Google Scholar

44. Shilpa S, Karthik B. Diabetic retinopathy automatic detection and classification in fundus images using modified residual convolutional neural networks (CNNs) with improved accuracy. In: International Conference on Intelligence Science. Cham: Springer (2023). p. 349–364.

Google Scholar

45. Peiris H, Chen Z, Egan G, Harandi M. Duo-SegNet: adversarial dual-views for semi-supervised medical image segmentation. In: Medical Image Computing and Computer Assisted Intervention-MICCAI 2021: 24th International Conference, Strasbourg: Springer (2021). p. 428–438.

Google Scholar

46. Peng H, Xue C, Shao Y, Chen K, Xiong J, Xie Z, et al. Semantic segmentation of litchi branches using DeepLabV3+ model. IEEE Access. (2020) 8:164546–55. doi: 10.1109/ACCESS.2020.3021739

Crossref Full Text | Google Scholar

47. Li R, Zheng S, Duan C, Su J, Zhang C. Multistage attention ResU-Net for semantic segmentation of fine-resolution remote sensing images. IEEE Geosci Remote Sens Letters. (2021) 19:1–5. doi: 10.1109/LGRS.2021.3063381

Crossref Full Text | Google Scholar

48. Wu H, Zhao Z, Wang Z. META-Unet: Multi-scale efficient transformer attention Unet for fast and high-accuracy polyp segmentation. IEEE Trans Automat Sci Eng. (2023). doi: 10.1109/TASE.2023.3292373

Crossref Full Text | Google Scholar

49. Lin A, Chen B, Xu J, Zhang Z, Lu G, Zhang D. Ds-transunet: dual swin transformer u-net for medical image segmentation. IEEE Trans Instrum Meas. (2022) 71:1–15. doi: 10.1109/TIM.2022.3178991

Crossref Full Text | Google Scholar

Keywords: medical robot vision, image fusion, interdisciplinary physics, DFRS, quality improvement

Citation: Chen L, Wu S and Leung SCH (2025) Interdisciplinary approaches to image processing for medical robotics. Front. Med. 12:1564678. doi: 10.3389/fmed.2025.1564678

Received: 26 January 2025; Accepted: 24 April 2025;
Published: 02 June 2025.

Edited by:

Shuqiang Wang, Chinese Academy of Sciences (CAS), China

Reviewed by:

Hathal Haddad, University of Tübingen, Germany
Changhong Jing, Hong Kong Polytechnic University, Hong Kong SAR, China

Copyright © 2025 Chen, Wu and Leung. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Shiwen Wu, d3dxYWc2QDE2My5jb20=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.