Optimizing rgb-d semantic segmentation through multi-modal interaction and pooling attention

Semantic segmentation of RGB-D images involves understanding the appearance and spatial relationships of objects within a scene, which requires careful consideration of various factors. However, in indoor environments, the simple input of RGB and depth images often results in a relatively limited acquisition of semantic and spatial information, leading to suboptimal segmentation outcomes. To address this, we propose the Multi-modal Interaction and Pooling Attention Network (MIPANet), a novel approach designed to harness the interactive synergy between RGB and depth modalities, optimizing the utilization of complementary information. Specifically, we incorporate a Multi-modal Interaction Fusion Module (MIM) into the deepest layers of the network. This module is engineered to facilitate the fusion of RGB and depth information, allowing for mutual enhancement and correction. Additionally, we introduce a Pooling Attention Module (PAM) at various stages of the encoder. This module serves to amplify the features extracted by the network and integrates the module's output into the decoder in a targeted manner, significantly improving semantic segmentation performance. Our experimental results demonstrate that MIPANet outperforms existing methods on two indoor scene datasets, NYUDv2 and SUN-RGBD, underscoring its effectiveness in enhancing RGB-D semantic segmentation.


Introduction
In recent years, Convolutional Neural Networks (CNN) have been widely used in image semantic segmentation, and more and more high-performance models have gradually replaced the traditional semantic segmentation methods.With the introduction of Fully Convolutional Neural Networks (FCN) [1,2], which show great potential in semantic segmentation tasks, many researchers have proposed improved semantic segmentation models based on this way.Nevertheless, semantic segmentation remains a formidable challenge in some indoor environments, given the intricacies such as variations in With the widespread application of depth sensors and depth cameras [3], the research on images is not limited to RGB color images, but the research on RGB-Depth (RGB-D) images containing depth information.RGB images can provide appearance information such as the color and texture of objects, in contrast, depth images can provide three-dimensional geometry information of objects, which is missing in RGB images and is desired for indoor scenes.References [4,5] simply splice RGB features and depth features to form a four-channel input, improving the accuracy of semantic segmentation.Reference [6] convert depth images into three distinct channels (horizontal disparity, height above ground, and angle of surface normals) to obtain the HHA image, then input the RGB features and HHA features into two parallel CNNs to predict the probability maps of two semantic segmentations, respectively, and fuse them in the last layer of the network as the final segmentation result.Though the above methods have achieved good results in the task of RGB-D semantic segmentation, most RGB-D semantic segmentation [7,8,9,10] simply merges RGB features and depth features by concatenation or summation.As a result, the information differences between the multimodal cannot be solved effectively, which will generate CNN not to use the complementary information between them fully, resulting in object and background confusion.For example, The printer and trash bin in Fig. 1 (a) are prone to be inaccurately assimilated into the background.
To solve the above problems, we propose an RGB-D semantic segmentation of the Indoor Scene network, MIPANet.Fig. 2 illustrates the overall structure of the network.The network is an encoderdecoder architecture, including two innovative feature fusion modules: The multi-modal Interaction Module(MIM) and the Pooling Attention Module(PAM).This paper integrates the two fusion modules into an encoder-decoder architecture.The encoder is composed of two identical CNN branches, each specifically designed for extracting RGB features and depth features, respectively.In this study, RGB and depth features are extracted and fused incrementally across various network levels, optimizing semantic segmentation results utilizing spatial disparities and semantic interdependencies among multimodal features.In the PAM, we use adaptive averaging instead of global averaging, which approach not only allows for flexible adjustment of the output size but also preserves more spatial information, facilitating enhanced extraction of depth features.In MIM, we obtain two sets of Q,K,V for different modalities and perform calculations using the Q,K from one set and V from the other.This achieves information interaction between the RGB and depth modalities.This paper's main contributions can be summarized as follows: • We introduce an end-to-end multi-modal fusion network, MIPANet, incorporating multi-modal interaction and pooling attention.This innovative approach optimizes integrating complementary information from RGB and depth features, effectively tackling the challenge posed by insufficient crossmodal feature fusion in RGB-D semantic segmentation.
• We present two cross-modal feature fusion methods.Within the MIM, a cross-modal feature interaction and fusion mechanism were developed.RGB and depth features are collaboratively optimized using attention masks to extract partially detailed features.In addition, PAM integrates intermediate layer features into the decoder, enhancing feature extraction and supporting the decoder in upsampling and recovery.
• Experimental results confirm the effectiveness of our proposed RGB-D semantic segmentation network in accurately handling indoor images in complex scenarios.The model demonstrated superior semantic segmentation performance compared to other methods on the publicly available NYUv2 and SUN RGB-D datasets.

Related Work
In this section, we provide a comprehensive review of three parts: (1) RGB-D Semantic Segmentation, (2) Attention Mechanism, and (3) Cross-modal Interaction.

RGB-D Semantic Segmentation
With the widespread application of depth sensors and depth cameras in the field of depth estimation [11,9,12,13], people can obtain the depth information of the scene more conveniently, and the research on the image is no longer limited to a single RGB image.RGB-D semantic segmentation task is to efficiently integrate RGB features and depth features to improve segmentation accuracy, especially in some indoor scenes.Couprie et al. [4] proposed an early fusion approach, which simply concatenates an image's RGB and depth channels as a four-channel input to the convolutional neural network.Wang et al. [6] separately input RGB features and HHA features into two CNNs for prediction and perform fusion in the final stage of the network, and [14] introduced an encoding-decoding network, employing a dual-branch RGB encoder to extract features separately from RGB images and depth images.The studies mentioned above employed equal-weight concatenation or summation operations to fuse RGB and depth features without fully leveraging the complementary information between different modalities.In recent years, some research has proposed more effective strategies for RGB-D feature fusion.Hu et al. [15] utilised a three-branch encoder that includes RGB, Depth, and Fusion branches, efficiently collecting features without breaking the original RGB and deep inference branches.Seichter et al. [16] have presented an efficient RGB-D segmentation approach, characterised by two enhanced ResNet-based encoders utilising an attention-based fusion for incorporating depth information.However, these methods did not fully exploit the differential information between the two modalities and the intermediate-level features extracted by the convolutional network.

Attention Mechanism
In recent years, attention [17,18,19,20,21,22] has been widely used in computer vision and other fields.Vaswani et al. [17] proposed the self-attention mechanism, which has had a profound impact on the design of the deep learning model.Fu et al. [19] proposed DANet, which can adaptively integrate local features and their global dependencies.Wang et al. [23] utilised spatial attention in an image classification model.Through the backpropagation of a convolutional neural network, they adaptively learned spatial attention masks, allowing the model to focus on the significant regions of the image.SENet [24] has proposed channel attention, which adaptively learns the importance of each feature channel through a neural network.Woo et al. [22] incorporates two attention modules that concurrently capture channel-wise and spatial relationships.ECA-Net [25] introduces a straightforward and efficient "local" channel attention mechanism to minimize computational overhead.MFC [26]introduced a multi-frequency domain attention module to capture information across different frequency domains.Similarly, CAMNet [2] proposed a contrastive attention module designed to amplify local saliency.Building upon this foundation, Huang et al. [27] proposed a cross-attention module that consolidates contextual information both horizontally and vertically, which can gather contextual information from all pixels.These methods have demonstrated significant potential in single-mode feature extraction.To effectively leverage the complementary information between different modalities, this paper introduces a Pooling Attention module that learns the differential information between two distinct modalities and fully exploits the intermediate-level features in the convolutional network and long-range semantic dependencies between modalities.

Cross-modal Interaction
With the development of sensor technology, different types of sensors can provide a variety of modal information for semantic segmentation tasks to achieve information interaction [28,29,30,31,32] between RGB mode and other modes.The interaction between RGB and infrared modalities enhanced the effectiveness of semantic segmentation in RGB-T scenarios.Xiang et al. [33] used a single-shot polarization sensor to build the first RGB-P dataset, incorporated polarization sensing to obtain supplementary information, and improved the accuracy of segmentation for many categories, especially those with polarization characteristics, such as glass.HPGN [34] proposes a novel pyramid graph network targeting features, which is closely connected behind the backbone network to explore multiscale spatial structural features.GiT [35] proposes a structure where graphs and transformers interact constantly, enabling close collaboration between global and local features for vehicle re-identification.Zhuang et al. [36] propose a network consisting of a two-streams (LiDAR stream and camera stream), which extract features from two modes respectively to realize information interaction between RGB and LIDAR modes.Improving the result of semantic segmentation by information interaction between different modes and RGB mode is feasible.

Overview
Fig. 2 depicts the overall structure of the network.The architecture follows an encoder-decoder design, employing skip connections to facilitate information flow between encoding and decoding layers.The encoder comprises a dual-branch convolutional network, with each branch respective to extracting RGB features and depth features.We utilize two pre-trained ResNet50 models as the backbone, which exclude the final global average pooling layer and fully connected layers.Subsequently, a decoder is employed to upsample the features, progressively restoring image resolution incrementally.RGB and F 0 Dep , which can be expressed as: where Conv 3×3 denotes 3 × 3 convolution.The network mainly consists of a four-layer encoder-decoder and introduces two feature fusion modules: MIM and the PAM.Each layer of the encoder consistes of a ResNetLayer.After F 0 i passing through the ResNetLayer, F n i is obtained, the n-th layer of the encoder can be expressed as: where H n i (n = 1, 2, 3, 4) represents the n-th ResNetLayer, i ∈ {RGB, Depth} denotes the RGB feature or Depth feature.Specifically, the first three multi-level RGB features (ResNetLayer1-ResNetLayer3) and depth features (ResNetLayer1-ResNetLayer3) of the ResNet encoder are fed into the PAM module.Pooled attention weighting operations are performed on the RGB features and depth features separately to obtain Fn RGB and Fn Dep , where n = 1, 2, 3. Subsequently, the two features are combined by element-wise addition to obtain Fn Con , containing rich spatial location information.Furthermore, the final RGB and depth features from the ResNetLayer4 encoder are fed into the MIM module to capture complementary information within these two modalities.The output features of the MIM module are then fed into the decoder, where each upsampling layer consists of two 3 × 3 convolutional layers.These layers are followed by batch normalization (BN) and ReLU activation, with each upsampling layer doubling the feature spatial dimensions while halving the number of channels.

Mathematical Biosciences and Engineering
Volume 19, Issue x, xxx-xxx The output feature Fn i is obtained by taking the weighted sum of the input feature F n i .

Pooling Attention Module
Within the low-level features extracted by the convolutional neural network, we capture the fundamental attributes of the input image.These low-level features are critical in modelling the image's foundational characteristics.However, they lack semantic information from the high-level neural network, such as object shapes and categories.At the same time, during the upsampling process in the decoding layer, there is a risk of losing certain semantic information as the image resolution increases.We introduce the Pooling Attention Module (PAM) to address this issue.The PAM module enhances the representation of these features by using an attention mechanism to focus on critical areas in the low-level feature map.In the decoding layer, we integrate the PAM module's output with the upsampling layer's input, effectively compensating for information loss during the upsampling process.This strategy improves the accuracy of segmentation results and efficiently maintains the integrity of semantic information, as shown in Fig. 3.
The input featre F n i ∈ R h×w×c where i ∈ {RGB, Depth} denotes the RGB feature or Depth feature passes through adaptive average pooling to reduce the feature map to a smaller dimension: where A ∈ R h ′ ×w ′ ×c represents the feature map that has been resized by adaptive averaging pooling, H ada denotes the adaptive average pooling operation.h ′ ,w ′ represent the height and width of the output feature map, which we set h ′ = 2 and w ′ = 2. Then we get the output features A ′ by max pooling the features after dimensionality reduction: where A ′ ∈ R 1×1×c represents the pooling result and then A ′ undergoes a 1 × 1 convolution and then activation with the sigmoid function, getting a weight vector V ∈ R 1×1×c value between 0 and 1. H max denotes the max pooling operation.Finally, we perform an Element-wise product for F n i and V, and the result Fn i can be expressed as: where ⊗ denotes the Element-wise product, Φ denotes 1 × 1 convolution, and feature maps Fn i represent the output feature Fn RGB or Fn Dep in Fig. 2. We employ two-step pooling operation instead of conventional global average pooling.Firstly, the input features F n i pass through adaptive average pooling to obtain the middle feature A with a specified output size.Then, A undergoes max pooling to yield the final result A ′ .This modification makes the network pay more attention to local regions in the image, such as objects near the background in the scene.Meanwhile, adapt average pooling can enhance the module's flexibility, accommodating diverse input feature map dimensions and fully retaining spatial position information in depth features; the visualization results Fig. 5 show the module's effectiveness.The final output Fn Con of the PAM in Fig. 2: During the upsampling process, Fn Con (n = 1, 2, 3) will play a role in the three-level decoder (decoder1-decoder3).

Multi-modal Interaction Module
When adjacent objects in an image share similar appearances, distinguishing their categories becomes challenging.Factors such as lighting variations and object occlusion, especially in the corners, can lead to their blending with the background.This complexity makes it difficult to precisely identify object edges, leading to misclassification of the object as part of the background.Depth information remains unaffected by lighting conditions and can accurately differentiate between objects and the background based on depth values.Therefore, we designed the MIM module to supplement RGB information with Depth features.Meanwhile, it utilizes RGB features to strengthen the correlation between RGB and depth features.
The Multi-modal Interaction Module achieves dual-mode feature fusion, as depicted in Fig. 4. Here, F 4  RGB ∈ R h×w×c and F 4 Dep ∈ R h×w×c correspond to the RGB feature and depth feature from the ResNet-Layer4.The feature channels are denoted as 'c', and their spatial dimensions are h × w.First, the two feature maps are linearly mapped to generate multi-head query(Q), key(K), and value(V) vectors.Here, 'rgb' and 'dep' represent the RGB and depth features.These linear mappings are accomplished via fully connected layers, where each attentional head possesses its unique weight matrix.For each attention head, We calculate the dot product between two sets of Q and K and then normalize the results to a range between 0 and 1 using the softmax function to get the transmembrane state attention mask W rgb and W dep : where W rgb and W dep represent the RGB attention mask and the Depth attention mask, and d k is the dimension of the vector.Then we calculate the RGB Weighted Feature FRGB and the Dep Weighted Feature FDep .We obtain the final output features F4 RGB and F4 Dep through the use of a residual connection: where FRGB represent the RGB Weighted Feature,V rgb represent the value vector from the RGB feature, multiplying with weight matrix W rgb .F4 RGB represents the RGB feature after the fusion with Depth.Likewise: where FDep represent the Depth Weighted Feature, V dep represent the value vector from the Depth feature, multiplying with weight matrix W dep .F4 Dep represents the Depth feature after the fusion with RGB, ⊗ represents the Element-wise product.Finally, we can obtain the MIM output through Elementwise sum, which can be formulated as:

Loss Function
In this paper, the network performs supervised learning on four different levels of decoding features.We employ nearest-neighbor interpolation to reduce the resolution of semantic labels.Additionally, 1 × 1 convolutions and Softmax functions are utilized to compute the classification probability for each pixel within the output features from the four upsample layers, respectively.The loss function L i of layer i is the pixel-level cross entropy loss: where N i denotes the number of pixels in layer i, p,q is the pixel position, Y ′ is the classification probability of the output, and Y is the label category.The final loss function L of the network is obtained by summing the pixel-level loss functions of the four decoding layers: By optimizing the above loss function, the network can get the final segmentation result after one training.

Datasets and Evaluation Measures
NYU-Depth V2 dataset [37] is a widely used indoor scene understanding dataset for computer vision and deep learning research.It is an aggregation of video sequences from various indoor scenes recorded by RGB-D cameras from the Microsoft Kinect and is an updated version of the NYU-Depth dataset published by Nathan Silberman and Rob Fergus in 2011.It contains 1449 RGBD images, depth images, and semantic tags in the indoor environment.The dataset includes different indoor scenes, scene types, and unlabeled frames, and each object can be represented by a class and an instance number.
SUN RGB-D dataset [38] contains image samples from multiple scenes, covering various indoor scenes such as offices, bedrooms, and living rooms.It has 37 categories and contains 10335 RGBD images with pixel-level annotations, of which 5285 are used as training images and 5050 are used as test images.This special dataset is captured by four different sensors: Intel RealSence, Asus Xtion, Kinect v1, and v2.Besides, this densely annotated dataset includes 146,617 2D polygons, 64,595 3D bounding boxes with accurate object orientations, and a 3D room layout as well as an imaged-based scene category.We evaluate the results using two standard metrics, Pixel Accuracy (Pixel Acc), and Mean Intersection Over Union (mIoU).
mIoU: Intersection over Union is a measure of semantic segmentation, where the intersection over Union ratio of a class is the ratio of the intersection over Union of its true labels and predicted values, while mIoU is the average intersection over Union ratio of each class in the dataset.
where p i j represents the predict i as j, and p ji represents the predict j as i, p ii means to predict the correct value, k represents the number of categories.Acc: Pixel accuracy refers to pixel accuracy, which is the simplest metric that represents the proportion of correctly labelled pixels to the total number of pixels.
where p ii means to predict the correct value, and p i j means to predict i to j.k represents the number of categories.

Implementation Details
We implemented and trained our proposed network model using the PyTorch framework.To enhance the diversity of the training data, we applied random scaling and mirroring.Subsequently, all RGB and depth images were resized to 480×480 for network inputs, and semantic labels were adjusted to sizes of 480 × 480, 240 × 240, 120 × 120, and 60 × 60 for deep supervision training.As the backbone for our encoder, we utilized a pre-trained ResNet50 [39] from the ImageNet classification dataset [40].To refine the network structure, following [41,42,43], we adjust it by replacing the 7 × 7 convolution in the input stem with three consecutive 3 × 3 convolutions.The training was conducted on an NVIDIA GeForce GTX 3090 GPU using stochastic gradient descent optimization.Parameters were set with a batch size of 6, an initial learning rate of 0.003, 500 epochs, and momentum and weight decay values of 0.9 and 0.0005, respectively.

Quantitative Results on NYUv2 and SUN RGB-D
Firstly, we compare the proposed method against existing approaches using the NYUv2 dataset.   1 illustrates our superior performance regarding mIoU and Acc metrics compared to other methods.Specifically, with ResNet50 serving as the encoder in our network, the pixel accuracy and average intersection-over-union (mIoU) for semantic segmentation on the NYUv2 test set reached 77.2% and 51.9%.For example, contrasting our method with RDFNet, which also employs ResNet50, our approach showcased a notable improvement of 2.4% in accuracy (Acc) and 3.2% in mean IoU (mIoU).This underscores a significant enhancement in segmentation accuracy achieved by our MIPANet, leveraging the identical ResNet50 architecture.Compared to SGNet, which utilizes ResNet101, our model demonstrates an improvement of 1.6% and 2.3% in Acc and mIoU, respectively.Notably, our ResNet50 outperforms ResNet101, showcasing the effectiveness of our carefully designed network structure and the multi-modal feature fusion module.These improvements in segmentation results are achieved without the need for complex networks, leading to reduced training time.Here, "R" represents ResNet, and the symbol '-' signifies that the comparison evaluated no accuracy metrics.We further compared different network structures across various methods, explicitly noting that ESANet incorporates two ResNet18s as the backbone, while ACNet utilizes three ResNet50 as the backbone.
Then, we comprehensively compared our proposed algorithm with existing methods on the SUN

Mathematical Biosciences and Engineering
Volume 19, Issue x, xxx-xxx RGB-D dataset.As depicted in Table 2, our approach consistently achieves higher mIoU scores on the SUN RGB-D dataset than all other methods.For instance, MIPANet outperforms SGNet, exhibiting an improvement of 1.3% and 1.7% in Acc and mIoU, respectively.This observation underscores our module's ability to maintain superior segmentation accuracy, even when dealing with the extensive SUN RGB-D dataset.For different backbone architectures, ResNet101 generally demonstrates better performance than ResNet50, while ResNet50, in turn, outperforms ResNet18.We opted for ResNet50 as our backbone to achieve commendable performance with reduced training time compared to ResNet101.Notably, our method exhibits an increase of 4.5% and 2.1% in mIoU and Acc on both datasets, respectively, compared to the baseline, as highlighted in the red section of the tables.

Visualization results on NYUv2
To visually highlight the advancements made by our method in the realm of RGB-D semantic segmentation, we provide visualization results of the network on the NYUv2 dataset.Compared to the baseline, our method has significantly improved segmentation results.Notably, the dashed box in the figure showcases our network enriched with depth information accurately distinguishes objects from the background.For instance, in the visualization results of the fourth image, the baseline erroneously categorizes the mirror on the wall as part of the background, in the visualization results of the second image, the ACNet and the ESANet mistook the carpet for a part of the floor.In contrast, leveraging depth information, our network discerns the distinct distance information of the mirror from the background, leading to a correct classification of the mirror.Fig. 5 illustrates the visualization results of the proposed algorithm on the NYUv2 dataset.From left to right, the columns depict the RGB image, the Depth image, the baseline model results with ResNet50 backbone, ACNet, ESANet, MIPANet (Ours), and Ground Truth.The algorithm presented in this paper has achieved precise segmentation outcomes in diverse and intricate indoor scenes.Moreover, it excels in segmenting challenging objects like "carpets" and "books" while delivering finer-edge segmentation results.

Ablation Study on PAM and MIM on NYUv2
We conducted ablation experiments comparing PAM and MIM on the NYUv2 dataset as show in Fig. 6.Specifically, the RGB feature and depth feature input PAM to obtain Fn RGB and Fn Dep .Given

Ablation Study on NYUv2 and SUN-RGBD
To investigate the impact of different modules on segmentation performance, we conducted ablation experiments on NYUv2 and SUN-RGBD datasets, as depicted in Table 3. ' ' indicates the usage of a particular module, while ' ' means not using the module.For instance, our PAM module exhibited a superiority of 1.5% and 0.9% over the baseline concerning mIoU and Acc indicators.Similarly, our MIM module demonstrated a superiority of 3.7% and 1.9% over the baseline regarding mIoU and Acc indicators.The result suggests that each proposed module can independently enhance segmentation accuracy.Our module surpasses the baseline in fusing cross-modal features, yielding superior results on both datasets.Using both PAM and MIM modules, we achieved the highest mIoU of 51.9% on the NYUv2 dataset and the highest mIoU of 48.8% on the SUN RGB-D dataset.The result highlights that our two designed modules can be collectively optimized to enhance segmentation accuracy.

Conclusions
In this paper, we tackle a fundamental challenge in RGB-D semantic segmentation-efficiently fusing features from two distinct modes.We designed an innovative Multi-modal Interaction and Pooling Attention network, which uses a small and flexible PAM module in the shallow layer of the network to enhance the feature extraction capability of the network and uses a MIM module in the last layer of the network to integrate RGB features and depth features effectively.We use the complementary information between RGB and depth mode to improve the accuracy of semantic segmentation in indoor scenes.In future work, we will extend our method to enhance its generalization ability in RGB-D semantic segmentation.Furthermore, we anticipate performance improvements by integrating tasks like depth estimation into the existing framework, facilitating collaborative network interactions.limitation.Our method's effectiveness has been exclusively validated on CNN networks, but we haven't verified other

Mathematical Biosciences and Engineering
Volume 19, Issue x, xxx-xxx network architectures, such as Transformer.In addition, during the segmentation verification on the test set, the requirement to input both RGB and depth images limits the network's generalization ability.Consequently, the network may not achieve optimal segmentation results for datasets lacking depth information.

Figure 1 .
Figure 1.Improve segmentation accuracy by leveraging depth features within our MIPANet.The prediction result can accurately distinguish the trash can and printer from the background.

Figure 2 .
Figure 2. Multi-modal Interaction And Pooling Attention (MIPA) Network architecture.Each PAM at different network levels generates two weight-unshared features: RGB features denoted as Fn RGB and depth features denoted as Fn Dep .Following an Element-wise sum, we obtain Fn Con , where n denotes the network level.MIM receives RGB and depth features from the ResNetLayer4 and integrates the fusion result F 4Con into the decoder.

Figure 3 .
Figure 3.The details of the Pooling Attention Module.After a two-step pooling operation, we obtain the pooling result A ′ .Subsequently, through a 1 × 1 convolution and sigmoid activation function, constrain the value of weight vector V (e.g., yellow) between 0 and 1.The output feature Fn i is obtained by taking the weighted sum of the input feature F n i .

Figure 4 . 4 RGB
Figure 4. Multi-modal Interaction Module.The RGB feature and the depth feature undergo linear transformations to generate two sets of Q,K,V (e.g., blue line) for multi-head attention, where h denotes the number of attention heads set to 8. The weighted summation of input features F 4 RGB and F 4 Dep yields F4 RGB and F4 Dep , which are then element-wise added to obtain the output result F4Con .

Figure 5 .
Figure 5. Visual result of MIPANet on NYUv2 dataset.The optimization effect is particularly notable within the red dotted box.

Figure 6 .
Figure 6.Ablation Study on PAM and MIM.When set to B1, the best segmentation result is 51.9% the modality differences, we addressed the parameter-sharing issue in PAM.Moreover, considering the impact of network depth on information interaction, we applied MIM in both Layer 3 and Layer 4 of the encoder.Fig. 6 presents the results of ablation studies on PAM and MIM using different configurations (B1-B4) on the NYUv2 dataset: B1 (PAM without shared parameters and MIM used on ResNetLayer4), B2 (PAM with shared parameters and MIM used on ResNetLayer4), B3 (PAM without shared parameters and MIM used on ResNetLayer3 and ResNetLayer4), B4 (PAM with shared param-

Table 1 .
MIPANet compared to the state-of-the-art methods on the NYUDv2 dataset.
Mathematical Biosciences and EngineeringVolume 19, Issue x, xxx-xxx

Table 2 .
MIPANet compared to the state-of-the-art methods on the SUN RGB-D dataset.

Table 3 .
Ablation studies on NYUDv2 and SUN-RGBD dataset for PAM and MIM