RoCD: leveraging foundation vision models with refine-and-fuse framework for robust change detection

Li, Mengwei; Zhao, Tongle; Liu, Yingchao; Yu, Ruteng; Wang, Lukun

doi:10.3389/frsen.2025.1744950

ORIGINAL RESEARCH article

Front. Remote Sens., 12 January 2026

Sec. Data Fusion and Assimilation

Volume 6 - 2025 | https://doi.org/10.3389/frsen.2025.1744950

This article is part of the Research TopicAdvanced Artificial Intelligence for Remote Sensing: Methods and Applications in Earth and Environmental MonitoringView all articles

RoCD: leveraging foundation vision models with refine-and-fuse framework for robust change detection

Mengwei Li

Tongle Zhao

Yingchao Liu

Ruteng Yu

Lukun Wang*

School of Intelligent Equipment, Shandong University of Science and Technology, Tai’an, Shandong, China

In recent years, Foundation Vision Models (FVMs) have provided new technical approaches for change detection and understanding of remote sensing images due to their strong generalization and multi-scale representation capabilities. However, in complex spatiotemporal scenarios, existing methods still face two major challenges: insufficient feature interaction and an imbalance between global and detail representation. To address these challenges, this paper proposes RoCD, which introduces the Refine-and-Align Framework (RAF) and FusionR-Decoder on top of the frozen basic visual model, FastSAM encoder. First, RAF introduces a pairwise difference refinement (PDR) mechanism to enhance feature interactions and effectively suppress spurious changes caused by inter-domain differences. Second, FusionR-Decoder embeds a three-branch RBlock based on state space models (SSMs) in the multi-scale decoding stage to achieve long dependency modeling and global consistency constraints. Experimental results on the three public datasets LEVIR-CD, LEVIR-CD+, and WHU-CD show that RoCD achieves F1/IoU scores of 92.11/85.38, 87.68/76.90, and 95.95/91.04, respectively.

1 Introduction

Remote sensing change detection (RSCD) refers to the process of identifying surface changes by analyzing multi-temporal remote sensing images (RSIs) acquired over the same geographical area. Change detection (CD) techniques are widely applied in land-use monitoring (Shi et al., 2020), urban expansion analysis (Buch et al., 2011), natural disaster assessment (Brunner et al., 2010), and military (Gong et al., 2015). With the large-scale acquisition of high-resolution RSIs, how to efficiently and accurately detect changes in massive data has become a research hotspot in remote sensing image analysis. However, due to complex factors such as sensor variations, illumination and seasonal differences, geometric misregistration, and scale inconsistency, CD tasks still face significant challenges (Cheng et al., 2024).

Previous RSCD methods have evolved from pixel-based processing to object-based change detection (OBCD) (Jung et al., 2020; Jiang et al., 2024). These methods are highly susceptible to noise, registration errors, and illumination variations (Sundaresan et al., 2007). Subsequently, machine learning methods (Huo et al., 2016; Xie et al., 2019) were introduced to further improve the classification accuracy of changed regions and reduce noise interference, building upon OBCD. However, these methods rely on manually designed features, and their generalization and feature representation capabilities remain limited when processing large-scale remote sensing data (Cheng et al., 2024). To overcome this bottleneck, researchers began to explore the potential of deep neural networks such as CNNs, attention mechanisms, and Transformers in change detection.

In recent years, Gu et al. presented a region-based convolutional neural network, using the U-net model and region change detection method to address the insufficient feature representation of traditional CD methods (Gu, 2024). Lu et al. combined CNN and combined attention to propose CANet (Lu et al., 2021), which further refined feature representation and focused on the problem of CD models correctly detecting small change regions and distinguishing the boundaries of change regions. Bandara and Patel proposed ChangeFormer (Bandara and Patel, 2022), and Chen et al. proposed BiT (Hao Chen and Shi, 2021), which have made significant progress in multi-scale modeling and bi-temporal semantic modeling, respectively. Moreover, Xiong et al. introduced a layered transformer to provide global context modeling capability and multiscale features to better process building boundary information (Xiong et al., 2024). In 2024, the Mamba state-space model was introduced into change detection, achieving more efficient long dependency modeling through selective state modeling (Chen et al., 2024). Due to Mamba’s excellent global feature capture capability, researchers began to explore hybrid architectures of Mamba and traditional deep learning networks. Liu et al. used CNNs to capture local details and Mamba to model global dependencies, thereby simultaneously improving the representation capabilities of both local and global features (Liu et al., 2025).

The aforementioned methods represent a significant research line in RSCD feature modeling. Beyond this line, another noteworthy development has emerged. The Foundation Vision Models (FVMs) (Bommasani, 2021) provide highly generalized multi-scale semantic representations for downstream tasks and exhibits good transferability in remote sensing (Lu et al., 2025). However, when FVMs are directly applied to cross-temporal CD, CD in complex spatiotemporal scenarios still faces two core challenges (Huo et al., 2025). First, insufficient interaction between bi-temporal features makes change information easily submerged by noise, leading to missed detections and false detections. Second, a large receptive field is beneficial for context but can easily damage boundaries and small targets; relying solely on shallow details lacks global consistency and is difficult to stabilize.

To address two issues, we propose our RoCD framework: introducing two key modules, Refine-and-Align Framework (RAF) and FusionR-Decoder, on top of frozen FastSAM encoded features to form a closed loop of “refining and aligning first, then progressively fusing”. RAF unifies the bi-temporal feature space through a multi-scale adapter and introduces a PDR mechanism to effectively suppress spurious changes caused by inter-domain differences and enhance real structural changes; FusionR-Decoder combines long dependency modeling across temporal, cross-scale, and feedback coupling in the multi-scale decoding stage, while taking into account both global consistency and fine-grained boundary fidelity. In summary, our contributions are as follows:

• We propose RAF to address the insufficient interaction between features in two temporal phases. This module enhances feature correlation between two temporal phases through structured differencing and adaptive alignment, enabling the model to capture real-world variations more effectively.

• To address the imbalance between global consistency and detailed representation, we design FusionR-Decoder. This module employs a three-branch refinement structure, progressively recovering details while modeling long-range dependencies, achieving a balance between global and local representations in the decoding stage.

• Without altering the underlying visual model, RoCD is a method for cross-temporal scenarios that combines differential refinement, alignment and recalibration, and feedback-enhanced decoding. Furthermore, we experimentally verified the accuracy and robustness of RoCD under complex spatiotemporal conditions on three commonly used CD datasets.

2 Related works

2.1 CNN/Siamese and attention-based methods

Early and mainstream CD approaches often adopt a dual-branch or weight-sharing encoder–decoder framework. FC-Siam-conc and FC-Siam-diff extract features with shared or independent Siamese encoders, then perform concatenation or differencing before decoding, serving as lightweight baselines. SNUNet-CD (Fang et al., 2022) further introduces nested and dense connections on top of U-Net to strengthen cross-layer feature reuse, enhancing the representation of small objects and boundaries. These approaches mainly emphasize multi-scale fusion and explicit/implicit differencing.

Subsequent studies incorporated attention mechanisms across spatial, channel, and spatiotemporal dimensions. STANet (Chen and Shi, 2020) employs spatiotemporal attention to align salient regions across temporal inputs, highlighting real changes while suppressing background noise. HANet (Han et al., 2023) integrates hybrid spatial–channel attention to improve semantic consistency and emphasize key regions, thereby reducing false alarms. Attention-based methods enhance focus on change-related regions and improve background suppression, while maintaining stable training and computational efficiency. They have thus become important baselines in RSCD. However, these methods still suffer from some common bottlenecks. Most methods perform feature differencing only at shallow layers or through simple stitching operations. This makes them susceptible to spurious changes and boundary tearing when encountering domain shifts caused by seasonal variations, lighting differences, or sensor inconsistencies, as well as slight registration errors between temporal images. This work extends the two-branch Siamese network, focusing on the robust fusion of more structured pairwise feature modeling and multi-scale decoding stages in the difference domain.

2.2 State space models

In recent years, temporal modeling has become an emerging research trend. Unlike conventional CNN-based architectures, Transformer or hybrid models leverage self-attention to enhance global dependencies across temporal inputs. For example, BIT introduces Transformer-based temporal interaction to capture long-range dependencies, while ChangeFormer combines Transformer encoders with convolutional decoders to balance global semantics and local boundary details. However, Transformers suffer from quadratic computational cost, limiting their scalability in high-resolution RS imagery. To address this, SSMs (Gu et al., 2022) provide an alternative by formulating sequence modeling as linear recurrence with gating. SSMs can efficiently capture long-range dependencies with linear complexity, offering better scalability. The Mamba family (Gu and Dao, 2023; Dao and Gu, 2024) further improves efficiency through selective state updates, reducing memory usage and latency while maintaining strong representational power. Several variants have been adapted for 2D vision tasks.

In RSCD, applications of SSMs are rapidly expanding. ChangeMamba (Chen et al., 2024) first introduced Mamba into the CD task lineage and verified the advantages of SSM in cross-temporal consistency and complex background. RS-Mamba (Zhao et al., 2024) proposed remote-sensing-specific backbones for dense high-resolution prediction, employing cross/diagonal scanning strategies to enlarge the receptive field without sacrificing resolution. CDMamba (Zhang et al., 2025) emphasized global–local fusion, combining long-range SSMs with boundary refinement modules such as selective residuals and context gating. These efforts highlighted the capability of SSMs to improve both large-scale context and fine-grained detail recognition. Following this line of research, the Mamba variant used in this paper employs a three-branch (T1, T2 and feedback) long dependency coupling in the multi-scale decoding stage, and collaborates with pairwise refined differential modeling to balance global consistency and detail fidelity under complex conditions.

2.3 Foundation vision models

The rise of FVMs has greatly advanced computer vision. Early FVMs such as the Masked Autoencoder(MAE) (He et al., 2022) achieved robust feature representations through large-scale self-supervised learning, improving generalization for downstream tasks. Later, the Segment Anything Model (SAM) (Kirillov et al., 2023) brought FVMs to a new level. SAM introduced a promptable framework (points, boxes, or text) combined with a powerful encoder to enable segmentation of arbitrary objects. Its variants, such as FastSAM (Zhao et al., 2023) and MobileSAM (Zhang et al., 2023), further optimized inference speed and lightweight deployment. These models demonstrated unprecedented generalization and transferability, making them valuable for high-resolution and low-annotation tasks.

In the RSCD domain, researchers have begun adapting FVMs to cross-temporal scenarios. MAE-CD (Song et al., 2024) explored MAE-pretrained features for CD to alleviate annotation scarcity. SAM-CD (Ding et al., 2024) was the first to employ SAM as a backbone for CD, freezing the encoder and introducing task-specific decoders to extract and fuse cross-temporal features, achieving strong performance across multiple datasets. These studies show that FVMs provide powerful general-purpose visual features, but direct adoption often limits their role to backbone extraction. Without task-specific decoding and alignment, temporal differences cannot be fully exploited. In this work, we adopt SAM-CD as our baseline and introduce targeted refinements to improve robustness and detection accuracy. Meanwhile, we continue to use FastSAM from SAM-CD instead of the heavier SAM as the frozen backbone. FastSAM inherits the general segmentation semantics and multi-scale representation capabilities of the SAM series, while having higher inference efficiency and lower computational overhead, enabling us to focus our research on structured solutions to the two key bottlenecks of cross-temporal feature interaction and global-detail balance.

3 Proposed methods

3.1 Overview

The overall framework of RoCD is illustrated in Figure 1, which mainly consists of three components: the SAM encoder, the Refine-and-Align Framework (RAF), and the FusionR-Decoder. First, the model adopts the lightweight FastSAM as the backbone to extract features from the bitemporal remote sensing images, producing multi-scale representations. These representations contain both spatial details and high-level semantics, providing a rich basis for subsequent differencing and decoding.

Figure 1

Flowchart depicting an image processing system with two input images, T1 and T2, each passing through a FastSAM Encoder. This connects to an RAF framework comprising Adaptors and PDR modules. The process flows into a Decoder, then through Rblocks, and concludes with six Residual processes, Head, and an output marked

Figure 1. Overall framework diagram of RoCD. Where a represents space attention, pdr represents pairwise difference refinement, and head represents channel reduction head.

Next, the extracted features are passed into the RAF module, which performs PDR and multi-scale adaptation in a joint manner. Details of RAF will be presented in Section 3.2. This module explicitly aligns cross-temporal and cross-domain features, thereby suppressing false changes caused by illumination, seasonal, and sensor variations while enhancing real structural changes. The refined multi-scale features from RAF are then progressively fed into the FusionR-Decoder. This decoder employs the RBlock as its core unit to facilitate spatiotemporal interactions across scales, while a feedback branch is introduced to strengthen long-range dependency modeling. More details on the FusionR-Decoder are provided in Section 3.3. Finally, the decoder restores fine-grained boundaries while maintaining global consistency, and the prediction head (resCD Head) produces the binary change map.

3.2 Refine-and-Align Framework (RAF)

For the $s$ -th scale feature map $F_{s}^{(t)} \in R^{C_{s} \times H_{s} \times W_{s}}$ output by the backbone, the Adapter $A_{s}$ employs ${Conv}_{1 \times 1}$ , BN, ReLU and Dropout to project the input into a unified target channel dimension ${\tilde{C}}_{s}$ . Where ${Conv}_{1 \times 1}$ denotes the $1 \times 1$ convolution kernel, BN represents batch normalization, ReLU denotes non-linear activation, and Drop denotes dropout. Typical values are ${\tilde{C}}_{32}$ = 160, ${\tilde{C}}_{16}$ = 160, ${\tilde{C}}_{8}$ = 80, ${\tilde{C}}_{4}$ = 40. The overall structure of RAF is shown in Figure 2.

Figure 2

Diagram of a neural network architecture labeled

Figure 2. Structure of the RAF section. Except for graphs, — represents signed difference.

At the $s$ -th scale, the projected bi-temporal features ${\tilde{F}}_{s}^{(1)}, {\tilde{F}}_{s}^{(2)}$ are first symbolically differentiated. The resulting differential features $D_{s}$ , are then structured and modeled. The modeling process consists of three branches. First, $D_{s}$ is differentially refined. After a series of convolutions, normalization, and other operations, the refined features, $U_{s}$ , are obtained. The calculation method is shown in (Equation 1).

D_{s} = {\tilde{F}}_{s}^{(1)} - {\tilde{F}}_{s}^{(2)}, D_{s} \in R^{{\tilde{C}}_{s} \times H_{s} \times W_{s}} (1)

Simultaneously, in another branch, forming the differential auxiliary feature $W_{s}$ .This can be considered as a deactivation operation for $U_{s}$ . Finally, $U_{s}$ and $W_{s}$ are channel-wise concatenated and projected to form $V_{s}$ .

For the obtained $W_{s}$ and $V_{s}$ , random attention is constructed and grouped convolution is performed to obtain $G_{s}$ . Global average pooling is then performed to obtain scale-level gating vectors and generate residual sample back-injection coefficients. A learnable scaling parameter $α$ is defined to construct the gating. Finally, the gate is back-injected into the original projected features as a residual to obtain the refined bi-temporal features. The overall operation is shown in (Equations 2–4).

G_{s} = ReLU (BN ({Conv}^{groups} (softmax (V_{s} ⊙ W_{s})))) (2)

γ_{s} = σ (AvgPool (G_{s})) (3)

{\hat{F}}_{s}^{(t)} = (1 + α γ_{s}) ⊙ {\tilde{F}}_{s}^{(t)}, t \in \{1,2\} (4)

Here $⊙$ denotes element-wise multiplication, ${Conv}^{groups}$ denotes grouped convolution, and softmax denotes channel/spatial-wise normalization. where $σ$ denotes the sigmoid activation function and AvgPool denotes global average pooling.

3.3 FusionR-decoder

As shown in Figure 3, we design a multi-scale fusion and refinement decoder named FusionR-Decoder, which aims to progressively integrate dual-temporal features through hierarchical decoding and multi-branch refinement. At each scale $s \in {4,8,16,32}$ , the decoder receives the projected feature maps ${\hat{F}}_{s}^{(1)}$ and ${\hat{F}}_{s}^{(2)}$ from $t_{1}$ and $t_{2}$ via the RAF module. Then, an upsampling–concatenation–convolution structure is applied to recover spatial resolution and fuse temporal semantics.

Figure 3

Diagram illustrating the FusionR-Decoder architecture, featuring three stages. Each stage consists of feature inputs, denoted as $\hat{F}^i_{32}$, $\hat{F}^i_{16}$, $\hat{F}^i_8$, and $\hat{F}^i_4$, connected to a combination unit (C), then feeding into a decoder. Outputs undergo three rounds of Conv1×1/BatchNorm/ReLU/Dropout processing, resulting in refined outputs. SSM components (stacked three times) are integrated into each stage before output as $R_S$, leading to final $F^{ref}_S$.

Figure 3. Structure of the FusionR-Decoder section. In the diagram, C represents channel-level connection, $\to$ represents upsampling, and other symbols are consistent with those in the corresponding formulas.

Formally, for the $s$ -th scale, the decoded feature $D_{s}^{(i)}$ of branch $i \in {1,2}$ is obtained by (Equation 5). Here $U p (\cdot)$ denotes bilinear upsampling, and ${D e c}_{s} (\cdot)$ represents a decoding block consisting of two $3 \times 3$ convolutions followed by Batch Normalization and ReLU activation.

\begin{aligned} D_{16}^{(i)} & = {Dec}_{16} (cat (up ({\hat{F}}_{32}^{(i)}), F_{16}^{(i)})) s = 16 \\ D_{s}^{(i)} & = {Dec}_{s} (cat (up (D_{2 \times s}^{(i)}), F_{s}^{(i)})) s = 8,4 \end{aligned} (5)

To effectively model cross-temporal dependencies and refine discriminative change information, we further introduce a three-branch refinement block at each decoding stage. As shown in (Formula 6), the features from $t_{1}$ branch $D_{s}^{(1)}$ , $t_{2}$ branch $D_{s}^{(2)}$ , and the feedback feature $F_{f b}$ are respectively processed through the state-space operator $M (\cdot)$ and then concatenated:

\begin{aligned} R_{16} & = cat \{M (D_{16}^{(i)}), M (cat (D_{16}^{(1)}, D_{16}^{(2)}))\} s = 16 \\ R_{s} & = cat \{M (R_{2 \times s}), M (D_{s}^{(i)})\} s = 8,4 \end{aligned} (6)

Finally, the refined multi-branch feature $F_{s}^{ref}$ is propagated to the next decoding stage, enabling progressive fusion from coarse to fine scales. After the final decoding level, spatial attention is applied to enhance the changed regions. The final prediction mask is generated through a $1 \times 1$ convolution.

4 Experimental results

4.1 Data description and implementation details

We conducted a comprehensive comparison of RoCD against several state-of-the-art (SOTA) methods on three widely used change detection datasets: LEVIR-CD (Chen and Shi, 2020), LEVIR-CD+ (Chen and Shi, 2020), and WHU-CD (Ji et al., 2019). The LEVIR-CD dataset contains more than 30,000 pairs of high-resolution remote sensing images with a spatial resolution of $1024 \times 1024$ , primarily covering the construction and demolition of urban buildings. LEVIR-CD+ is an extended version of LEVIR-CD with a larger volume of data and more diverse scenarios. WHU-CD mainly focuses on factory buildings and urban areas, which feature more complex backgrounds and textures.

In terms of implementation, RoCD is developed based on the PyTorch framework, and all experiments are conducted on an NVIDIA RTX 4090 GPU. The AdamW optimizer is employed with an initial learning rate of $1 \times 1 0^{- 3}$ . A cosine annealing strategy with a power exponent of 1.5 is used to dynamically adjust the learning rate, ensuring smooth and stable convergence during training. The weight decay coefficient is set to 0.01, and the gradients are clipped with a maximum norm of 1.0 to prevent gradient explosion. The batch size is set to 4, and the model is trained for a total of 50 epochs. The loss function combines binary cross-entropy loss and latent similarity loss to jointly optimize the discriminability of change regions and the temporal consistency of features. During training, input images are randomly cropped to $512 \times 512$ patches and randomly flipped to enhance robustness, while a sliding-window strategy is adopted during validation to ensure spatial completeness of the predictions. Throughout training, the model checkpoint with the highest F1 score on the validation set is saved for subsequent visualization and comparative analysis.

4.2 Quantitative comparison and visualization analysis

We selected six representative SOTA change detection methods for comparison, including STANet (Chen and Shi, 2020), SNUNet (Fang et al., 2022), BIT (Hao Chen and Shi, 2021), ChangeFormer (Bandara and Patel, 2022), ChangeMamba (Chen et al., 2024), and SAM-CD (Ding et al., 2024). These methods span different technical paradigms, from CNN-based approaches to Transformer architectures, Mamba state-space models, and Vision Foundation Model (VFM) transfer, representing the mainstream evolution of change detection research. To ensure fairness, all baseline results were retrained using the same protocol as RoCD. We compared RoCD with the aforementioned SOTA models on three publicly available datasets. The datasets cover typical application scenarios such as urban building demolition and reconstruction, complex backgrounds, and multiple scene changes, and can comprehensively reflect the generalization and robustness of the models.

The experimental results are shown in Tables 1–3. RoCD consistently outperforms the SOTA methods listed in the tables, except for the Pre metric on LEVIR-CD and LEVIR-CD+. Using F1 and IoU as the main indicators, under LEVIR-CD, F1 reached 92.11% and IoU reached 85.38%. Compared with the baseline SAM-CD, RoCD improves F1 and IoU by approximately 1.18% and 2.01%, respectively. This indicates that the RAF and FusionR-Decoder introduced by RoCD not only utilize the performance of pre-trained FVMs, but also fully leverage their representational capabilities, ensure sufficient feature interaction, and effectively mitigate the interference of pseudo-changes on prediction accuracy. On the LEVIR-CD+ and WHU-CD datasets, RoCD also maintains a leading position. This performance also demonstrates RoCD’s ability to balance global consistency and fine-grained fidelity.

Table 1

Table 1. Quantitative comparison of different methods on the Levir-CD dataset. The best results are highlighted in bold.

Table 2

Table 2. Quantitative comparison of different methods on the Levir-CD + dataset. The best results are highlighted in bold.

Table 3

Table 3. Quantitative comparison of different methods on the WHU-CD dataset. The best results are highlighted in bold.

Furthermore, to more intuitively analyze the model’s detection performance, we visualized and compared typical samples, as shown in Figure 4. RoCD, through the cross-temporal information refined by differential decoding, forms a more coherent response pattern within the target, reducing fragmented predictions caused by insufficient interaction. Simultaneously, the global consistency constraint and progressive detail recovery of fusion decoding ensure consistency between global semantics and local boundaries, resulting in more complete and continuous boundary reconstruction and clearer small-scale structures. These visualization results further demonstrate that RoCD can simultaneously enhance cross-temporal feature interaction and balance global and detailed representations, thereby improving the discriminability and consistency of changing regions.

Figure 4

A series of three-panel rows (A, B, C) showing satellite images and classification maps. Columns show T1, T2, Ground Truth, and six models (a–f, Ours) with true negative (black), true positive (white), false positive (red), and false negative (blue) areas.

Figure 4. Visualization results for LEVIR-CD, LEVIR-CD+, and WHU-CD. (A) Results on LEVIR-CD. (B) Results on LEVIR-CD+. (C) Results on WHU-CD. (a–f) represent the comparison models, with the tables showing STANet, SNUNet, BIT, ChangeFormer, ChangeMamba, and SAM-CD.

4.3 Ablation experiments

To evaluate the contribution of each module to the overall performance, we conducted a step-by-step ablation study on the LEVIR-CD dataset. As shown in Table 4, the baseline model, which consists only of the SAM encoder and a basic decoder structure, achieves an F1 score of 90.93% and an IoU of 83.37%. When only the FusionR-Decoder is introduced, the model can better fuse multi-scale information during the decoding stage, improving the F1 score to 91.73% and the IoU to 84.72%. When only the RAF module is added, the model effectively suppresses pseudo-changes through differential refinement and domain adaptation, resulting in an F1 score of 91.84% and an IoU of 85.08%. When both modules are applied simultaneously, the performance is further enhanced, achieving an F1 score of 92.11% and an IoU of 85.38%, which are the best results across multiple metrics.

Table 4

Table 4. Ablation experiments on the Levir-CD dataset. The best results are highlighted in bold. In the table, F stands for FusionR-Decoder and R stands for RAF.

To intuitively demonstrate the contribution of each module, we further provide difference visualizations computed from binarized change masks on the LEVIR-CD dataset. As shown in Figure 5, panel A shows the difference between the baseline and the model with only the FusionR-Decoder, illustrating where it alters predictions on fine-grained boundaries and local structures. Panel B presents the difference between the baseline and the model equipped with both RAF and the FusionR-Decoder, reflecting the overall effect of their collaboration. Panel C shows the difference between the model with both modules and the model with only the FusionR-Decoder, thereby isolating and visualizing the incremental contribution of RAF in cross-temporal feature refinement and suppression of spurious changes. These observations are consistent with the quantitative ablation results reported in Table 4.

Figure 5

A set of six panels. The top row shows three satellite images labeled T1, T2, and GT. T1 and T2 depict residential areas with roads and houses, while GT shows a binary mask of buildings in white on a black background. The bottom row includes images labeled A, B, and C with green outlines of building shapes on black backgrounds, corresponding to the areas in the GT image.

Figure 5. Qualitative ablation was performed using difference maps on the LEVIR-CD dataset. (A) is the difference map between the baseline model and the model using only FusionR-Decoder. (B) is the difference map between the baseline model and the model using both RAF and FusionR-Decoder. (C) is the difference map between the model using both RAF and FusionR-Decoder and the model using only FusionR-Decoder.

These results demonstrate that RAF and FusionR-Decoder play complementary roles in feature refinement and multi-scale decoding, respectively. Their combination enhances fine-grained boundary discrimination while maintaining global consistency, leading to more robust change detection.

5 Conclusion and future work

This paper focuses on two major challenges faced by FVMs in the RSCD task: insufficient inter-temporal feature interaction and the difficulty in balancing global consistency and detailed representation. To address these shortcomings, we propose a Feature Refinement and Fusion Decoding Framework (RoCD). This framework introduces a feature refinement module after the frozen FVM backbone. Through differential modeling and feature recalibration mechanisms, it enhances the feature interaction between the two temporal phases and effectively suppresses spurious changes caused by inter-domain differences. Simultaneously, a multi-scale fusion decoding structure is designed to recover detailed features stepwise under global consistency constraints, thereby achieving a balanced representation of global semantics and local boundaries.

Experimental results on multiple publicly available CD datasets demonstrate that RoCD outperforms mainstream baseline models in key metrics such as accuracy, F1 score, and IoU, validating its effectiveness in improving feature interaction and optimizing multi-scale representations. Overall, RoCD provides an effective approach to fully leveraging the potential of FVMs in modeling spatiotemporal variations. However, RoCD still relies on a frozen FVM backbone network; future work will explore dynamic adaptation and multimodal fusion. We plan to further explore its integration with large-scale prior knowledge to expand its application potential in multimodal and cross-regional scenarios.

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: The datasets analyzed in this study are publicly available benchmark datasets for remote sensing change detection. Specifically: LEVIR-CD: available at https://justchenhao.github.io/LEVIR/WHU-CD: available at http://gpcv.whu.edu.cn/data/WHU-CD/ These datasets were used for experimental evaluation only. No new data were generated for this study.

Author contributions

ML: Writing – original draft, Writing – review and editing. TZ: Writing – original draft. YL: Writing – review and editing. RY: Writing – original draft. LW: Writing – review and editing.

Funding

The author(s) declared that financial support was not received for this work and/or its publication.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Bandara, W. G. C., and Patel, V. M. (2022). “A transformer-based siamese network for change detection,” in IGARSS 2022 - 2022 IEEE international geoscience and remote sensing symposium, 207–210. doi:10.1109/IGARSS46834.2022.9883686

CrossRef Full Text | Google Scholar

Bommasani, R. (2021). On the opportunities and risks of foundation models. arXiv Preprint arXiv:2108.

Google Scholar

Brunner, D., Bruzzone, L., and Lemoine, G. (2010). “Change detection for earthquake damage assessment in built-up areas using very high resolution optical and sar imagery,” in 2010 IEEE international geoscience and remote sensing symposium (IEEE), 3210–3213.

Google Scholar

Buch, N., Velastin, S., and Orwell, J. (2011). A review of computer vision techniques for the analysis of urban traffic. Intelligent Transp. Syst. IEEE Trans.12, 920–939. doi:10.1109/TITS.2011.2119372

CrossRef Full Text | Google Scholar

Chen, H., and Shi, Z. (2020). A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens. 12, 1662. doi:10.3390/rs12101662

CrossRef Full Text | Google Scholar

Chen, H., Song, J., Han, C., Xia, J., and Yokoya, N. (2024). Changemamba: remote sensing change detection with spatiotemporal state space model. IEEE Trans. Geoscience Remote Sens. 62, 1–20. doi:10.1109/TGRS.2024.3417253

CrossRef Full Text | Google Scholar

Cheng, G., Huang, Y., Li, X., Lyu, S., Xu, Z., Zhao, H., et al. (2024). Change detection methods for remote sensing in the last decade: a comprehensive review. Remote Sens. 16, 2355. doi:10.3390/rs16132355

CrossRef Full Text | Google Scholar

Dao, T., and Gu, A. (2024). “Transformers are SSMs: generalized models and efficient algorithms through structured state space duality,” in International conference on machine learning (ICML).

Google Scholar

Ding, L., Zhu, K., Peng, D., Tang, H., Yang, K., and Bruzzone, L. (2024). Adapting segment anything model for change detection in hr remote sensing images. IEEE Trans. Geoscience Remote Sens. 62, 1–11. doi:10.1109/TGRS.2024.3368168

CrossRef Full Text | Google Scholar

Fang, S., Li, K., Shao, J., and Li, Z. (2022). Snunet-cd: a densely connected siamese network for change detection of vhr images. IEEE Geoscience Remote Sens. Lett. 19, 1–5. doi:10.1109/LGRS.2021.3056416

CrossRef Full Text | Google Scholar

Gong, M., Zhao, J., Liu, J., Miao, Q., and Jiao, L. (2015). Change detection in synthetic aperture radar images based on deep neural networks. IEEE Transactions Neural Networks Learning Systems 27, 125–138. doi:10.1109/TNNLS.2015.2435783

PubMed Abstract | CrossRef Full Text | Google Scholar

Gu, L. (2024). Region change detection model for remote sensing images based on u-net. Adv. Eng. Technol. Res. 10, 554. doi:10.56028/aetr.10.1.554.2024

CrossRef Full Text | Google Scholar

Gu, A., and Dao, T. (2023). Mamba: linear-time sequence modeling with selective state spaces. arXiv Preprint arXiv:2312.00752.

Google Scholar

Gu, A., Goel, K., and Ré, C. (2022). Efficiently modeling long sequences with structured state spaces.

Google Scholar

Han, C., Wu, C., Guo, H., Hu, M., and Chen, H. (2023). Hanet: a hierarchical attention network for change detection with bi-temporal very-high-resolution remote sensing images. IEEE J. Sel. Top. Appl. Earth Observations Remote Sens. 16, 1–17doi. doi:10.1109/JSTARS.2023.3264802

CrossRef Full Text | Google Scholar

Hao Chen, Z. Q., and Shi, Z. (2021). Remote sensing image change detection with transformers. IEEE Trans. Geoscience Remote Sens., 1–14doi. doi:10.1109/TGRS.2021.3095166

CrossRef Full Text | Google Scholar

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick, R. (2022). “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 16000–16009.

Google Scholar

Huo, C., Chen, K., Ding, K., Zhou, Z., and Pan, C. (2016). Learning relationship for very high resolution image change detection. IEEE J. Sel. Top. Appl. Earth Observations Remote Sens. 9, 3384–3394. doi:10.1109/JSTARS.2016.2569598

CrossRef Full Text | Google Scholar

Huo, C., Chen, K., Zhang, S., Wang, Z., Yan, H., Shen, J., et al. (2025). When remote sensing meets foundation model: a survey and beyond. Remote Sens. 17, 179. doi:10.3390/rs17020179

CrossRef Full Text | Google Scholar

Ji, S., Wei, S., and Lu, M. (2019). Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geoscience Remote Sens. 57, 574–586. doi:10.1109/TGRS.2018.2858817

CrossRef Full Text | Google Scholar

Jiang, W., Sun, Y., Lei, L., Kuang, G., and Ji, K. (2024). Change detection of multisource remote sensing images: a review. Int. J. Digital Earth 17, 2398051. doi:10.1080/17538947.2024.2398051

CrossRef Full Text | Google Scholar

Jung, S., Lee, W. H., and Han, Y. (2020). “Object-based change detection of vhr imagery based on extension of various pixel-based methods,” in 40th Asian conference on remote sensing: progress of remote sensing technology for smart future. ACRS 2019.

Google Scholar

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., et al. (2023). Segment anything. 3992–4003. doi:10.1109/iccv51070.2023.00371

CrossRef Full Text | Google Scholar

Liu, Y., Cheng, G., Sun, Q., Tian, C., and Wang, L. (2025). Cwmamba: leveraging cnn-mamba fusion for enhanced change detection in remote sensing images. IEEE Geoscience Remote Sens. Lett. 22, 1–5. doi:10.1109/LGRS.2025.3548145

CrossRef Full Text | Google Scholar

Lu, D., Wang, L., Cheng, S., Li, Y., and Du, A. (2021). Canet: a combined attention network for remote sensing image change detection. Information 12, 364. doi:10.3390/info12090364

CrossRef Full Text | Google Scholar

Lu, S., Guo, J., Zimmer-Dauphinee, J. R., Nieusma, J. M., Wang, X., Wernke, S. A., et al. (2025). Vision foundation models in remote sensing: a survey. IEEE Geoscience Remote Sens. Mag. 13, 190–215. doi:10.1109/mgrs.2025.3541952

CrossRef Full Text | Google Scholar

Shi, S., Zhong, Y., Zhao, J., Lv, P., and Zhang, L. (2020). Land-use/land-cover change detection based on class-prior object-oriented conditional random field framework for high spatial resolution remote sensing imagery. IEEE Trans. Geoscience Remote Sens. PP, 1–16. doi:10.1109/tgrs.2020.3034373

CrossRef Full Text | Google Scholar

Song, B., Chen, J., Shi, S., Yang, J., Chen, C., Qiao, K., et al. (2024). Cd-mae: contrastive dual-masked autoencoder pre-training model for pcb ct image element segmentation. Electronics 13, 1006. doi:10.3390/electronics13061006

CrossRef Full Text | Google Scholar

Sundaresan, A., Varshney, P. K., and Arora, M. K. (2007). Robustness of change detection algorithms in the presence of registration errors. Photogrammetric Eng. & Remote Sens. 73, 375–383. doi:10.14358/pers.73.4.375

CrossRef Full Text | Google Scholar

Xie, Z., Wang, M., Han, Y., and Yang, D. (2019). “Hierarchical decision tree for change detection using high resolution remote sensing images,” in Geo-informatics in sustainable ecosystem and society. Editors Y. Xie, A. Zhang, H. Liu, and L. Feng (Singapore: Springer Singapore), 176–184.

CrossRef Full Text | Google Scholar

Xiong, J., Liu, F., Wang, X., and Yang, C. (2024). Siamese transformer-based building change detection in remote sensing images. Sensors 24, 1268. doi:10.3390/s24041268

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, C., Han, D., Qiao, Y., Kim, J. U., Bae, S.-H., Lee, S., et al. (2023). Faster segment anything: towards lightweight sam for mobile applications. arXiv Preprint arXiv:2306.14289.

Google Scholar

Zhang, H., Chen, K., Liu, C., Chen, H., Zou, Z., and Shi, Z. (2025). Cdmamba: incorporating local clues into mamba for remote sensing image binary change detection. IEEE Trans. Geoscience Remote Sens. 63, 1–16. doi:10.1109/TGRS.2025.3545012

CrossRef Full Text | Google Scholar

Zhao, X., Ding, W., An, Y., Du, Y., Yu, T., Li, M., et al. (2023). Fast segment anything

Google Scholar

Zhao, S., Chen, H., Zhang, X., Xiao, P., Bai, L., and Ouyang, W. (2024). Rs-mamba for large remote sensing image dense prediction. IEEE Trans. Geosci. Remote Sens. 62, 1–14. doi:10.1109/tgrs.2024.3425540

CrossRef Full Text | Google Scholar

Keywords: remote sensing change detection, VFMs, SSMs, RAF, FusionR-Decoder

Citation: Li M, Zhao T, Liu Y, Yu R and Wang L (2026) RoCD: leveraging foundation vision models with refine-and-fuse framework for robust change detection. Front. Remote Sens. 6:1744950. doi: 10.3389/frsen.2025.1744950

Received: 12 November 2025; Accepted: 22 December 2025;
Published: 12 January 2026.

Edited by:

Guangliang Cheng, University of Liverpool, United Kingdom

Reviewed by:

Shuchang Lyu, Beihang University, China
Ye Yuan, Northeast Forestry University, China

Copyright © 2026 Li, Zhao, Liu, Yu and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Lukun Wang, d2FuZ2x1a3VuQHNkdXN0LmVkdS5jbg==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.