Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Med., 12 January 2026

Sec. Nuclear Medicine

Volume 12 - 2025 | https://doi.org/10.3389/fmed.2025.1744097

This article is part of the Research TopicFostering Trustworthy Artificial Intelligence in Medical ImagingView all 3 articles

A foundation model-driven multi-view collaborative framework for semi-supervised 3D medical image segmentation

  • 1Department of Radiology, First Hospital of Shanxi Medical University, Taiyuan, China
  • 2Department of Medical Imaging, Shanxi Medical University, Taiyuan, China
  • 3Department of Nuclear Medicine and Medical PET Center, The Second Hospital of Zhejiang University School of Medicine, Hangzhou, China

Background: 3D medical image segmentation is a cornerstone for quantitative analysis and clinical decision-making in various modalities. However, acquiring high-quality voxel-level annotations is both time-consuming and labor-intensive. Semi-supervised learning (SSL) provides an appealing solution by effectively utilizing limited labeled data along with abundant unlabeled data to enhance segmentation performance under clinical data constraints.

Methods: We propose a foundation model-driven multi-view collaborative learning framework that exploits zero-shot capabilities of SAM-like foundation models to jointly learn from axial, sagittal, and coronal planes. A collaborative fusion module integrates complementary representations across views, enhancing 3D structural understanding and improving the performance with limited annotation cost.

Results: Extensive experiments on two evaluation datasets including MRI brain tumor segmentation and whole-body PET heart segmentation demonstrate that our proposed method consistently outperforms existing SAM-based semi-supervised approaches. The multi-view collaborative design not only refines boundary precision for organ and tumor delineation but also shows strong transferability across imaging modalities.

Conclusion: This study presents a foundation model-driven, multi-view collaborative learning paradigm that efficiently advances semi-supervised 3D medical image segmentation, which provides a scalable and clinically meaningful solution that reduces annotation dependency while maintaining high segmentation accuracy across diverse medical imaging modalities.

1 Introduction

Three-dimensional (3D) medical imaging is widely used for creating cross-sectional volumetric images within various body regions, which plays a fundamental role in quantitative analysis, computer-assisted diagnosis, and treatment planning across diverse imaging modalities (1, 2). It enables the delineation of anatomical structures and pathological regions within volumetric data, facilitating accurate disease localization and progression monitoring (3, 4). For example, brain tumor segmentation from magnetic resonance imaging (MRI) is crucial step to determine the location and size of tumor areas, which plays a vital role in surgical planning, treatment design, efficacy evaluation, and longitudinal monitoring (5, 6). The complexity of brain tumors, characterized by significant heterogeneity in invasiveness, prognosis, and tissue characteristics, poses a challenge to accurate segmentation (7). In addition to MRI, positron emission tomography (PET) has emerged as another important 3D imaging modality that provides quantitative insights into physiological and metabolic processes (8, 9). PET heart segmentation plays a key role in cardiac function assessment, lesion detection, and treatment response evaluation, enabling clinicians to localize metabolic abnormalities and measure tissue activity in a noninvasive manner.

Existing clinical standards rely on the manual delineation of tumor or lesion boundaries by expert radiologists and radiation oncologists, which is not only exceptionally time-consuming and labor-intensive but also subject to significant inter- and intra-observer variability, potentially affecting the consistency of patient care. The clear need for automated, objective, and efficient segmentation tools has driven extensive research into deep learning-based solutions. Recently, deep learning-based segmentation methods have shown significant improvements for many medical image segmentation tasks (1012). However, the development of robust and generalizable deep learning models is fundamentally constrained by a persistent data bottleneck. These models typically require vast amounts of high-quality, pixel-wise annotated data for training, which are often costly and time-consuming to collect (13). Manually annotating medical images at the pixel level is a costly and time-consuming process that requires the expertise of experienced clinical professionals, which significantly hinders the practical deployment of medical image segmentation models in real-world scenarios (14).

To address this challenge, semi-supervised medical image segmentation (SSMIS) has emerged as a compelling paradigm to address this data scarcity challenge (15). SSMIS aims to build accurate segmentation models by leveraging a small amount of labeled data in conjunction with a large amount of unlabeled data.

Most state-of-the-art SSMIS methods rely on two foundational schemes: pseudo labeling and unsupervised regularization. For pseudo labeling, the core is generating reliable pseudo labels for unlabeled data from pre-trained or dynamically updated models and using them as weakly supervised data (1618) with quality-control strategies like confidence thresholding (19, 20) and adaptive refinement (21, 22). In contrast, unsupervised regularization does not generate pseudo labels but imposes invariance constraints to learn robust features from unlabeled data (2326). One commonly used strategy is consistency learning by enforcing consistent predictions for perturbed unlabeled images (2730). However, the performance of these methods inherently depending on the knowledge transfer from labeled to unlabeled data. As such, most methods struggles to achieve satisfactory outcomes when faced with extremely limited labeling budgets.

Concurrently, the field of artificial intelligence has been revolutionized by the advent of large-scale, pre-trained foundation models. In computer vision, the Segment Anything Model (SAM) (31) represents a landmark achievement. Trained on an unprecedented scale of over one billion masks from 11 million natural images, SAM exhibits remarkable capabilities for zero-shot and few-shot generalization across a diverse range of segmentation tasks. Recently, some approaches have attempted to integrate SAM (31) into SSMIS frameworks. Several medical SAM adaptions are also emerged to enhance the performance on medical images due to the distribution gap between natural images and medical images (32, 33). Despite SAM requires prompts for interactive segmentation, integrating SAM into SSMIS frameworks can build a robust automatic segmentation framework that effectively utilizes unlabeled data using foundation models (34, 35). Existing approaches mainly generate point prompts from prediction regions of SSMIS to enable automatic segmentation using SAM. Despite these efforts, pseudo labels often contain noisy regions, and directly generating prompts from such pseudo labels can introduce inaccuracies in the learning procedure (36).

In the realm of 3D medical imaging, the image data is inherently multi-dimensional different from 2D images, which can be utilized to capture anatomical structures from various perspectives. Leveraging this multi-view information is crucial for several reasons. Firstly, different views including axial, coronal, and sagittal provide complementary information about the volumetric relationships and morphological characteristics of anatomical structures. Fusing these views can lead to a more comprehensive understanding of the underlying anatomy, which is essential for accurate segmentation (37). Besides, multi-view fusion can mitigate the impact of noise and artifacts that may be present in one view but not in others. By integrating information from multiple perspectives, the model can become more robust to such imperfections, leading to more reliable segmentation and analysis (38).

In this paper, we propose a foundation model-driven multi-view collaborative learning framework that extends SAM-like foundation models to jointly learn from axial, sagittal, and coronal planes for fusion of different features to enhances the learning process. We aim to enhance the quality of pseudo labels to improve the overall performance in the learning procedure. Its core innovation lies in the first-time extension of SAM-like foundation models to multi-view (axial, sagittal, and coronal) collaborative learning for 3D medical images. We validate the effectiveness of our framework on two evaluation datasets including MRI brain tumor segmentation (39) and whole-body PET heart segmentation (9). The experimental results demonstrate our proposed method significantly improves the performance for 3D medical segmentation of existing SAM-based semi-supervised methods, which highlight the potential of our framework to advance treatment planning with minimal manual annotation efforts for developing artificial intelligence algorithms.

2 Materials and methods

2.1 Task definition

In the semi-supervised 3D medical image segmentation (SSMIS) task, the dataset consists of two parts: a small set of labeled images and a large set of unlabeled images. The labeled subset provides voxel-wise annotations that identify specific anatomical or pathological structures, such as brain tumors in MRI scans. These annotations are essential for guiding supervised learning but are limited in number due to the high cost and expertise required for manual labeling. In contrast, the unlabeled subset contains a large number of medical images without annotations, which still carry rich spatial and contextual information that can be exploited to improve model generalization.

The goal of the SSMIS task is to develop a segmentation model that can accurately identify target regions by jointly learning from both labeled and unlabeled data. Let x denote a 3D medical image and ŷ = fθ(x) represent the predicted segmentation map generated by a model fθ parameterized by θ. The model is optimized by combining the supervised learning objective on labeled data with a regularization or consistency term on unlabeled data, formulated as:

Ltotal=Lsup+λLunsup,    (1)

where Lsup measures the segmentation error against available annotations, Lunsup enforces prediction consistency or feature regularization on unlabeled data, and λ controls the balance between the two losses.

In summary, the task aims to leverage the complementary strengths of limited labeled and abundant unlabeled 3D medical images to achieve accurate and robust segmentation performance. This semi-supervised strategy is particularly valuable for clinical applications, where annotated datasets are scarce, but large volumes of unlabeled MRI or PET images are readily available.

2.2 Overview of SemiSAM

Semi-supervised 3D medical image segmentation (SSMIS) aims to effectively exploit both limited labeled data and abundant unlabeled data to enhance segmentation performance. A representative paradigm in this domain is the Mean Teacher framework (40), which maintains two networks with identical architectures but different update strategies: a student model and a teacher model. The student model is directly optimized by gradient descent on the labeled data using a supervised loss Lsup, while the teacher model serves as a temporal ensemble of the student model, updated by an exponential moving average (EMA) of its weights (41). For each unlabeled image, the teacher model generates pseudo-labels ỹ, which are then used to enforce prediction consistency between the student and teacher networks. This strategy encourages smooth and stable decision boundaries in the feature space, effectively leveraging unlabeled data to improve model generalization. However, when labeled data are extremely scarce, the pseudo-labels generated by the teacher model can be inaccurate or incomplete, limiting performance gains (42). To address this issue, SemiSAM (34) introduces a foundation model-driven enhancement by integrating the Segment Anything Model (SAM) into the Mean Teacher framework. Specifically, the teacher model produces coarse segmentation maps on unlabeled data, which are used to generate point- or box-based prompts for SAM. Leveraging its strong zero-shot generalization capability, SAM produces refined pseudo-labels ỹSAM that provide more reliable supervision signals. These SAM-derived pseudo-labels are incorporated into the training pipeline through an additional regularization term that aligns the student model's prediction with SAM's output:

Ltotal=Lsup+λ1Lcon+λ2LSAM,    (2)

where LSAM enforces consistency between the segmentation network and SAM predictions, and λ1, λ2 are weighting coefficients.

By combining the complementary strengths of teacher-student consistency learning and SAM's generalizable segmentation priors, SemiSAM enhances pseudo-label quality and stabilizes training under low-annotation regimes. This makes the framework particularly effective in challenging clinical scenarios, such as when only one or a few labeled 3D volumes are available, significantly improving both the accuracy and robustness of semi-supervised medical image segmentation.

2.3 Proposed multi-view collaborative framework

In 3D medical imaging, anatomical structures can be visualized from multiple orthogonal planes, typically the axial, sagittal, and coronal views. Each view provides complementary contextual and structural information that contributes to a more complete understanding of the underlying anatomy. To effectively exploit this multi-view information, we propose a multi-view collaborative SemiSAM (MVC-SemiSAM) framework, as illustrated in Figure 1.

Figure 1
Machine learning diagram illustrating a process involving an image encoder and mask decoder. Three MRI views—axial, sagittal, and coronal—undergo rotations and are encoded. A prompt encoder aids in guiding segmentation using a student-teacher model with exponential moving average (EMA). Final outputs show masked MRI structures, with fusion calculations enhancing results.

Figure 1. The overall architecture proposed multi-view collaborative SemiSAM (MVC-SemiSAM) framework. The framework utilizes SAM-like foundation model for axial, sagittal and coronal planes for fusion of different features to enhances the learning process. We aim to enhance the quality of pseudo labels to improve the overall performance in the learning procedure.

Building upon the SemiSAM paradigm, MVC-SemiSAM extends the segmentation process from a single-view setting to a collaborative multi-view learning scheme. Specifically, separate SemiSAM branches are established for each view, denoted as v∈{axial, sagittal, coronal}. Each branch performs independent inference with view-specific prompts and pseudo-label generation, enabling the model to capture unique textural and spatial cues present in that particular orientation. This design preserves the complementary nature of different projections while maintaining the independence necessary for robust feature learning.

After obtaining view-specific segmentation outputs ŷv, MVC-SemiSAM introduces a collaborative fusion mechanism to aggregate these predictions and generate unified supervision for the mean teacher framework. The multi-view fusion is expressed as:

y^fuse=({y^v})=vVwv·y^v, where V={axial,sagittal,coronal}    (3)

where F(·) denotes a view-collaboration function that integrates information across all views via averaging, confidence-weighted fusion, or attention-based refinement. This fused prediction is then used to guide both the student and teacher models through consistency constraints, encouraging the network to maintain cross-view coherence and spatial continuity. Accordingly, the total loss function for MVC-SemiSAM can be formulated as

Ltotal=Lsup(fθ(x),y)+λ1Lcon(fθ(x),fθ'(x))+λ2LSAM(y^fuse,fθ(x)),    (4)

where Lsup is the supervised loss on labeled data, Lcon is the consistency loss with the teacher model, LSAM enforces agreement between the student's prediction and the multi-view fused pseudo-label ŷfuse, and λ1, λ2 are weighting coefficients.

By leveraging complementary information across orthogonal planes, MVC-SemiSAM effectively mitigates the influence of artifacts, noise, or partial occlusions that may occur in a single view. The collaborative learning strategy also enhances pseudo-label reliability and promotes volumetric consistency during semi-supervised training. In this way, the proposed framework achieves more accurate and stable segmentation results across 3D medical imaging modalities.

3 Results

3.1 Dataset and evaluation metrics

Our proposed method is evaluated on two different datasets. The first dataset is the Brain Tumor Segmentation (BraTS) 2019 dataset (39). The dataset contains multi-institutional preoperative MRI of 335 glioma patients, where each patient has four modalities of MRI scans including T1, T1Gd, T2, and FLAIR with neuroradiologist-examined labels. An example of the dataset is shown in Figure 2. For our research, we focus on the semi-supervised segmentation of whole tumors using FLAIR MRI images, as FLAIR images are particularly effective in characterizing malignant tumors due to their ability to highlight areas of brain edema and tumor infiltration, making them a preferred modality for this task (43). The second dataset is the PET heart segmentation derived from AutoPET-Organ (9), which contains 100 FDG-PET images with expert-examined annotations of the heart for quantitative evaluation of the cardiac metabolism. All the scans are resampled to the same resolution of 1 × 1 × 1 mm3 with intensity normalized to zero mean and unit variance. In our experiments, we split the MRI dataset into 250 scans for training, 25 scans for validation and the remaining 60 scans for testing, and the PET dataset into 40 scans for training, 10 scans for validation, and 50 scans for testing. Among the training scans, we follow the design of Zhang et al. (42) using the same 1/2/3/5 scans as labeled data and the remaining scans as unlabeled data. To effectively balance the supervised segmentation loss and the consistency regularization terms, we employ time-dependent weighting strategies utilizing a sigmoid-like ramp-up weighting coefficient to mitigate the disturbance of consistency loss during the early training stages as λ1=0.1·e-5(1-t/tmax), and a ramp-down weighting coefficient to leverage the strong zero-shot capabilities of the foundation model while preventing potential negative transfer in later stages as λ2=0.1·e-5(t/tmax), where t represents the current iteration number, and tmax represents the maximum number of iterations. We utilize average fusion of results from multiple view, where wv = 1/3. To quantitatively evaluate the segmentation results, we use four complementary evaluation metrics. Dice similarity coefficient (Dice) and Jaccard Index (Jaccard), two region-based metrics, are used to measure the region mismatch. Average surface distance (ASD) and 95% Hausdorff Distance (95HD), two boundary-based metrics, are used to evaluate the boundary errors between the segmentation results and the ground truth.

Figure 2
Tiled images of brain MRIs in different modalities: T1 MRI, T2 MRI, T1Gd MRI, FLAIR MRI, and a Tumor Mask. The first four images show different contrasts and details of brain structures, while the Tumor Mask highlights a region in red, indicating the presence of a tumor.

Figure 2. Comparison of different MRI modalities with corresponding annotation mask for brain tumor segmentation.

3.2 Implementation details

To implement the SSMIS for brain tumor segmentation, we use the official codebase of SemiSAM+ in our experiments. Following the same setting, we use SAM-Med3D (44) as the SAM backbone. We use the Stochastic Gradient Descent (SGD) optimizer to update the network parameters with an initial learning rate of 0.01 decayed by 0.1 every 2,500 iterations. The maximum training iterations is set to 6,000. The batch size is set to 2, consisting of one labeled images and one unlabeled images in each mini-batch. We randomly crop 128 × 128 × 128 sub-volumes as the network input and the final segmentation results are obtained using a sliding window strategy. We use the standard data augmentation techniques on-the-fly to avoid overfitting during the training procedure, including randomly flipping, and rotating with 90, 180, and 270 degrees along the axial plane (38).

3.3 Experimental setup

To validate the effectiveness of our proposed framework, we compare it against different SSMIS implementations. All methods were evaluated under the same protocol using 1, 2, 3, and 5 labeled scans, respectively, with the remainder of the training scans serving as unlabeled data. The compared methods include (1) baseline: a standard supervised model trained using only the limited labeled data, without any SSL or SAM-based components. This represents the lower bound performance. (2) w/o SemiSAM: this is the classic Mean Teacher framework (40), a widely-used SSL method. It serves to quantify the performance gain from a standard SSL approach that leverages unlabeled data through consistency regularization. (3) SemiSAM (S/A/C): this represents the SemiSAM method (34), where the Mean Teacher framework is augmented with pseudo-labels generated by the Segment Anything Model (SAM). To establish strong baselines and for the purpose of ablation study, we apply SemiSAM independently on each of the three anatomical 3D views: Sagittal (S), Axial (A), and Coronal (C). (4) MVC-SemiSAM: our proposed Multi-view Collaborative SemiSAM framework. This method ensembles the outputs from the three independent view-specific SemiSAM branches to collaboratively guide the learning of the mean teacher framework, aiming to produce more robust and accurate segmentations.

3.4 Experimental results on MRI dataset

Table 1 summarizes the quantitative performance of all compared methods under varying numbers of labeled cases on MRI brain tumor segmentation task. Overall, MVC-SemiSAM consistently outperforms baseline and existing SemiSAM variants across all metrics, demonstrating the effectiveness of multi-view collaborative learning in semi-supervised 3D segmentation. With only one labeled case, the baseline model achieves a Dice score of 42.82% and a Jaccard index of 29.01%, indicating poor segmentation performance due to extremely limited supervision. Incorporating SemiSAM significantly improves performance, with Dice scores reaching 67%–68% and Jaccard indices around 53%–56%, reflecting the benefit of SAM-generated pseudo-labels in providing additional supervision. Among these, MVC-SemiSAM further improves the Dice score to 71.63% and the Jaccard index to 59.50%, while simultaneously reducing the 95HD to 33.63 voxels and ASD to 13.52 voxels. These results suggest that multi-view fusion not only enhances overlap-based accuracy but also improves boundary delineation, mitigating the effects of noise or ambiguous structures present in a single view. As the number of labeled cases increases, all methods exhibit gradual performance improvements. With two labeled cases, MVC-SemiSAM achieves a Dice of 74.35% and a Jaccard of 62.29%, outperforming the best SemiSAM variant by approximately 1%–2%. Notably, the boundary-based metrics (95HD 23.45 voxels, ASD 8.07 voxels) also indicate more precise contour delineation, highlighting the model's ability to maintain volumetric consistency across the three views. This trend continues with three and five labeled cases, where MVC-SemiSAM achieves Dice scores of 76.08 and 77.91%, respectively, while keeping 95HD and ASD consistently lower than all baselines. Comparing single-view SemiSAM variants (S, A, C) to MVC-SemiSAM, it is evident that multi-view collaboration contributes substantial gains. While the single-view variants already improve pseudo-label quality via SAM, they are still susceptible to view-specific artifacts or local ambiguities. The multi-view fusion in MVC-SemiSAM effectively aggregates complementary information from axial, sagittal, and coronal views, resulting in more reliable and consistent segmentation predictions. Figure 3 presents the visual comparison of brain tumor segmentation results obtained by different methods on representative MRI slices. From left to right are the input image, results of 3D U-Net (fully supervised baseline), Mean Teacher without SAM (w/o SemiSAM), single-view SemiSAM from sagittal (S), axial (A), and coronal (C) planes, our proposed MVC-SemiSAM, and the ground truth (GT) tumor mask. The highlighted red regions indicate the predicted tumor areas. It can be observed that MVC-SemiSAM produces more accurate and complete tumor boundaries, closely matching the ground truth, while reducing false positives and missing regions compared to other models.

Table 1
www.frontiersin.org

Table 1. The quantitative performance of all compared methods on MRI brain tumor segmentation.

Figure 3
Comparison of brain tumor segmentation methods using different techniques. The images display MRI scans with red overlays indicating tumor predictions. From left to right: original image, 3D U-Net FS prediction, without SemiSAM, SemiSAM (S), SemiSAM (A), SemiSAM (C), MVC-SemiSAM, and ground truth tumor mask. Each method shows varying levels of accuracy in highlighting the tumor regions.

Figure 3. Visual comparison of segmentation results of different methods for MRI brain tumor segmentation.

3.5 Experimental results on PET dataset

Following the MRI-based experiments, we further evaluate the generalization capability of our proposed MVC-SemiSAM framework on the whole-body PET heart segmentation task. The quantitative results summarized in Table 2 reveal several consistent and noteworthy findings. First, similar to the MRI results, both semi-supervised learning (SSL) and SAM-based supervision demonstrate clear advantages over the fully supervised baseline trained with limited labeled data. The incorporation of unlabeled PET scans enables the Mean Teacher framework (w/o SemiSAM) to achieve a substantial improvement in Dice and Jaccard scores, confirming that semi-supervised consistency regularization effectively alleviates overfitting in low-annotation regimes. Second, introducing SAM-generated pseudo-labels through the SemiSAM variants further enhances segmentation performance, particularly in terms of region completeness and boundary accuracy. The improvement is especially pronounced in the extremely low-label settings (e.g., one or two labeled cases), where foundation model guidance helps the network capture organ boundaries that are difficult to infer from limited labeled examples alone. Most importantly, our proposed MVC-SemiSAM achieves the highest performance across nearly all metrics and annotation levels. The collaborative fusion of multi-view predictions yields more stable and coherent segmentation results, as evidenced by consistently higher Dice and Jaccard scores and reduced boundary errors (95HD and ASD). This validates the robustness of the proposed framework in handling modality-specific noise and anatomical variability inherent in PET imaging. Overall, the PET experiments further confirm the universality and transferability of MVC-SemiSAM across different 3D medical imaging modalities. By effectively leveraging unlabeled data and multi-view structural information, the framework provides a scalable solution applicable to both anatomical (MRI) and functional (PET) segmentation tasks.

Table 2
www.frontiersin.org

Table 2. The quantitative performance of all compared methods on PET heart segmentation.

Figure 4 illustrates the qualitative comparison of different methods on representative PET heart segmentation examples. Consistent with the quantitative results, the proposed MVC-SemiSAM produces segmentation maps that are visually closest to the ground truth, achieving more complete and coherent heart boundaries. These qualitative improvements highlight the effectiveness of the proposed multi-view collaborative mechanism in refining SAM-guided pseudo-labels and leveraging cross-view consistency to achieve anatomically precise PET segmentation. Together with the MRI findings, these results demonstrate the robustness and modality adaptability of MVC-SemiSAM for both structural and functional medical imaging tasks.

Figure 4
Four rows show heart segmentation comparisons using different techniques. The first row displays 3D heart models in red. The second and third rows show axial and coronal PET scan slices with red heart segmentation overlays. The fourth row lists the techniques: w/o SemiSAM, SemiSAM(A), MVC-SemiSAM, and GT Heart Mask.

Figure 4. Visual comparison of segmentation results of different methods for PET heart segmentation.

4 Discussion

This study investigates the integration of foundation model priors and semi-supervised learning for efficient 3D medical image segmentation under limited annotation scenarios. The experimental findings reveal several key insights into the advantages and generalizability of the proposed framework.

First, consistency-based SSL proves to be a reliable strategy for mitigating the dependence on extensive manual annotations (40). By jointly optimizing predictions from labeled and unlabeled data, the framework enhances the robustness and generalization of segmentation models, which is particularly valuable for clinical applications where annotation resources are scarce. The observed improvement over purely supervised baselines demonstrates the effectiveness of SSL in harnessing abundant unlabeled medical scans to improve 3D segmentation quality (15). Integrating foundation models like the Segment Anything Model (SAM) (31) into the semi-supervised learning pipeline introduces an additional layer of generalist prior knowledge. As foundation models are trained on large-scale annotated data, they provides shape priors, boundary cues, and prompt-driven flexibility that complement the limited domain-specific supervision available in medical datasets (44). Through pseudo-label generation and consistency constraints, SAM acts as an auxiliary teacher, effectively transferring its structural awareness into the medical domain. This cross-domain supervision leads to more anatomically coherent predictions, especially in regions with ambiguous intensity contrast or irregular boundaries (32).

The core innovation of our framework lies in its multi-view collaborative learning mechanism, which fully leverages the inherent 3D nature of volumetric imaging. By processing sagittal, coronal, and axial planes independently and fusing their predictions, the framework captures complementary spatial information and mitigates artifacts or ambiguities that may arise from a single projection. This collaborative fusion substantially enhances model stability and improves boundary precision, demonstrating the value of multi-view reasoning for complex anatomical structures. Importantly, the benefits of MVC-SemiSAM extend beyond a specific organ or imaging modality. The consistent performance across both anatomical MRI and functional PET modalities highlights its modality-agnostic and task-general nature. This suggests that MVC-SemiSAM is not limited to neuroimaging but can serve as a unified strategy for 3D medical image segmentation. Moreover, the frameworks modular design allows seamless adaptation to other semi-supervised paradigms and foundation models. Future iterations could integrate domain-specific vision-language models or multi-modal priors to further enhance generalization and interpretability. By bridging the gap between foundation model priors and domain-aware multi-view reasoning, MVC-SemiSAM offers a promising direction toward universal, label-efficient segmentation solutions in medical imaging.

5 Conclusion

This study introduces a foundation model-driven, multi-view collaborative learning framework that advances semi-supervised 3D medical image segmentation under extremely limited annotation conditions. By synergistically combining the generalist prior knowledge of SAM, the data efficiency of consistency-based semi-supervised learning, and the inherent robustness of multi-view representation, MVC-SemiSAM provides a label-efficient yet highly accurate segmentation framework. Validated on both MRI and PET modalities, the proposed method demonstrates strong adaptability and clinical relevance, reducing] annotation dependency while maintaining reliable segmentation quality across different imaging domains. Its modular and scalable design offers a practical foundation for future extensions, such as integrating multimodal data or domain-adaptive foundation models for universal medical image understanding.

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found at: https://www.med.upenn.edu/cbica/brats-2019/.

Author contributions

LL: Conceptualization, Investigation, Writing – original draft. BW: Conceptualization, Investigation, Writing – review & editing. HZ: Conceptualization, Investigation, Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This study was supported by the Shanxi Provincial Basic Research Program Fund (Grant no. 202203021222375).

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1. Van Ginneken B, Schaefer-Prokop CM, Prokop M. Computer-aided diagnosis: how to move from the laboratory to the clinic. Radiology. (2011) 261:719–32. doi: 10.1148/radiol.11091710

PubMed Abstract | Crossref Full Text | Google Scholar

2. Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, et al. A survey on deep learning in medical image analysis. Med Image Anal. (2017) 42:60–88. doi: 10.1016/j.media.2017.07.005

PubMed Abstract | Crossref Full Text | Google Scholar

3. Ma J, Zhang Y, Gu S, Zhu C, Ge C, Zhang Y, et al. AbdomenCT-1K: is abdominal organ segmentation a solved problem? IEEE Trans Pattern Anal Mach Intell. (2022) 44:6695–714. doi: 10.1109/TPAMI.2021.3100536

PubMed Abstract | Crossref Full Text | Google Scholar

4. Lalande A, Chen Z, Pommier T, Decourselle T, Qayyum A, Salomon M, et al. Deep learning methods for automatic evaluation of delayed enhancement-MRI. The results of the EMIDEC challenge. Med Image Anal. (2022) 79:102428. doi: 10.1016/j.media.2022.102428

PubMed Abstract | Crossref Full Text | Google Scholar

5. LaBella D, Adewole M, Alonso-Basanta M, Altes T, Anwar SM, Baid U, et al. The ASNR-MICCAI brain tumor segmentation (BraTS) challenge 2023: intracranial meningioma. arXiv [preprint]. (2023). arXiv:2305.07642. doi: 10.48550/arXiv.2305.07642

PubMed Abstract | Crossref Full Text | Google Scholar

6. Moawad AW, Janas A, Baid U, Ramakrishnan D, Saluja R, Ashraf N, et al. The brain tumor segmentation-metastases (BraTS-METS) challenge 2023: brain metastasis segmentation on pre-treatment MRI. arXiv [preprint]. (2024) arXiv:2306.00838. doi: 10.48550/arXiv.2306.00838

Crossref Full Text | Google Scholar

7. Zaitout Z, Romanowski C, Karunasaagarar K, Connolly D, Batty R. A review of pathologies associated with high T1W signal intensity in the basal ganglia on magnetic resonance imaging. Pol J Radiol. (2014) 79:126. doi: 10.12659/PJR.890043

PubMed Abstract | Crossref Full Text | Google Scholar

8. Oreiller V, Andrearczyk V, Jreige M, Boughdad S, Elhalawani H, Castelli J, et al. Head and neck tumor segmentation in PET/CT: the HECKTOR challenge. Med Image Anal. (2022) 77:102336. doi: 10.1016/j.media.2021.102336

PubMed Abstract | Crossref Full Text | Google Scholar

9. Zhang Y, Xue L, Zhang W, Li L, Liu Y, Jiang C, et al. SegAnyPET: universal promptable segmentation from positron emission tomography images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025). p. 21107–16.

Google Scholar

10. Havaei M, Davy A, Warde-Farley D, Biard A, Courville A, Bengio Y, et al. Brain tumor segmentation with deep neural networks. Med Image Anal. (2017) 35:18–31. doi: 10.1016/j.media.2016.05.004

PubMed Abstract | Crossref Full Text | Google Scholar

11. Pereira S, Pinto A, Alves V, Silva CA. Brain tumor segmentation using convolutional neural networks in MRI images. IEEE Trans Med Imaging. (2016) 35:1240–51. doi: 10.1109/TMI.2016.2538465

PubMed Abstract | Crossref Full Text | Google Scholar

12. Yang J, Qiu P, Zhang Y, Marcus DS, Sotiras A. D-net: dynamic large kernel with dynamic feature fusion for volumetric medical image segmentation. Biomed Signal Process Control. (2026) 113:108837. doi: 10.1016/j.bspc.2025.108837

Crossref Full Text | Google Scholar

13. Tajbakhsh N, Jeyaseelan L, Li Q, Chiang JN, Wu Z, Ding X. Embracing imperfect datasets: a review of deep learning solutions for medical image segmentation. Med Image Anal. (2020) 63:101693. doi: 10.1016/j.media.2020.101693

PubMed Abstract | Crossref Full Text | Google Scholar

14. Shi Y, Ma J, Yang J, Wang S, Zhang Y. Beyond pixel-wise supervision for medical image segmentation: from traditional models to foundation models. arXiv [preprint]. (2024). arXiv:2404.13239. doi: 10.48550/arXiv.2404.13239

Crossref Full Text | Google Scholar

15. Jiao R, Zhang Y, Ding L, Xue B, Zhang J, Cai R, et al. Learning with limited annotations: a survey on deep semi-supervised learning for medical image segmentation. Comput Biol Med. (2024) 169:107840. doi: 10.1016/j.compbiomed.2023.107840

PubMed Abstract | Crossref Full Text | Google Scholar

16. Thompson BH, Di Caterina G, Voisey JP. Pseudo-label refinement using superpixels for semi-supervised brain tumor segmentation. In: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI). Kolkata, India: IEEE (2022). p. 1–5. doi: 10.1109/ISBI52829.2022.9761681

Crossref Full Text | Google Scholar

17. Yao H, Hu X, Li X. Enhancing pseudo label quality for semi-supervised domain-generalized medical image segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36 (2022). p. 3099–107. doi: 10.1609/aaai.v36i3.20217

Crossref Full Text | Google Scholar

18. Zeng Q, Xie Y, Lu Z, Xia Y. PEFAT: boosting semi-supervised medical image classification via pseudo-loss estimation and feature adversarial training. In:Martel AL, , editors. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, BC (2023). p. 15671–80. doi: 10.1109/CVPR52729.2023.01504

Crossref Full Text | Google Scholar

19. Li Y, Chen J, Xie X, Ma K, Zheng Y. Self-loop uncertainty: a novel pseudo-label for semi-supervised medical image segmentation. In: Medical Image Computing and Computer Assisted Intervention - MICCAI 2020. MICCAI 2020. Lecture Notes in Computer Science, Vol. 12261. Cham: Springer (2020). p. 614–23. doi: 10.1007/978-3-030-59710-8_60

Crossref Full Text | Google Scholar

20. Lu L, Yin M, Fu L, Yang F. Uncertainty-aware pseudo-label and consistency for semi-supervised medical image segmentation. Biomed Signal Process Control. (2023) 79:104203. doi: 10.1016/j.bspc.2022.104203

Crossref Full Text | Google Scholar

21. Zheng B, Zhao W, Liu W, He Z, Qin C, Yang H. Semi-supervised medical image segmentation via pseudo-labeling refinement and dual-adaptive adjustment schemes. Pattern Recognit. (2025) 171:112310. doi: 10.1016/j.patcog.2025.112310

Crossref Full Text | Google Scholar

22. Ma J, Nie Z, Wang C, Dong G, Zhu Q, He J, et al. Active contour regularized semi-supervised learning for COVID-19 CT infection segmentation with limited annotations. Phys Med Biol. (2020) 65:225034. doi: 10.1088/1361-6560/abc04e

PubMed Abstract | Crossref Full Text | Google Scholar

23. Zhang Y, Zhang J. Dual-task mutual learning for semi-supervised medical image segmentation. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer (2021). p. 548–59. doi: 10.1007/978-3-030-88010-1_46

Crossref Full Text | Google Scholar

24. Luo X, Wang G, Liao W, Chen J, Song T, Chen Y, et al. Semi-supervised medical image segmentation via uncertainty rectified pyramid consistency. Med Image Anal. (2022) 80:102517. doi: 10.1016/j.media.2022.102517

PubMed Abstract | Crossref Full Text | Google Scholar

25. Liao W, He J, Luo X, Wu M, Shen Y, Li C, et al. Automatic delineation of gross tumor volume based on magnetic resonance imaging by performing a novel semisupervised learning framework in nasopharyngeal carcinoma. Int J Radiat Oncol Biol Phys. (2022) 113:893–902. doi: 10.1016/j.ijrobp.2022.03.031

PubMed Abstract | Crossref Full Text | Google Scholar

26. Zhang Y, Jiao R, Liao Q, Li D, Zhang J. Uncertainty-guided mutual consistency learning for semi-supervised medical image segmentation. Artif Intell Med. (2023) 138:102476. doi: 10.1016/j.artmed.2022.102476

PubMed Abstract | Crossref Full Text | Google Scholar

27. Zeng Q, Xie Y, Lu Z, Lu M, Zhang J, Xia Y. Consistency-guided differential decoding for enhancing semi-supervised medical image segmentation. IEEE Trans Med Imaging. (2024) 44:44–56. doi: 10.1109/TMI.2024.3429340

PubMed Abstract | Crossref Full Text | Google Scholar

28. Zeng Q, Luo H., Ma X., Lu Z., Hu Y., Xia Y. (2026). Exploring Text-Enhanced Mixture-of-Experts for Semi-supervised Medical Image Segmentation with Composite Data. In: Gee, J.C., et al. Medical Image Computing and Computer Assisted Intervention - MICCAI 2025. MICCAI 2025. Lecture Notes in Computer Science, vol 15965. Springer, doi: 10.1007/978-3-032-04978-0_22

Crossref Full Text | Google Scholar

29. Zeng Q, Lu Z, Xie Y, Xia Y. PICK: predict and mask for semi-supervised medical image segmentation. Int J Comput Vision. (2025) 133:3296–311. doi: 10.1007/s11263-024-02328-9

Crossref Full Text | Google Scholar

30. Zeng Q, Xie Y, Lu Z, Lu M, Wu Y, Xia Y. Segment together: a versatile paradigm for semi-supervised medical image segmentation. IEEE Trans Med Imaging. (2025) 44:2948–59. doi: 10.1109/TMI.2025.3556310

PubMed Abstract | Crossref Full Text | Google Scholar

31. Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, et al. Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023). p. 4015–26. doi: 10.1109/ICCV51070.2023.00371

Crossref Full Text | Google Scholar

32. Zhang Y, Shen Z, Jiao R. Segment anything model for medical image segmentation: current applications and future directions. Comput Biol Med. (2024) 171:108238. doi: 10.1016/j.compbiomed.2024.108238

PubMed Abstract | Crossref Full Text | Google Scholar

33. Ali M, Wu T, Hu H, Luo Q, Xu D, Zheng W, et al. A review of the segment anything model (SAM) for medical image analysis: accomplishments and perspectives. Comput Med Imaging Graph. (2025) 119:102473. doi: 10.1016/j.compmedimag.2024.102473

PubMed Abstract | Crossref Full Text | Google Scholar

34. Zhang Y, Yang J, Liu Y, Cheng Y, Qi Y. SemiSAM: enhancing semi-supervised medical image segmentation via sam-assisted consistency regularization. In: 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Lisbon, Portugal: IEEE (2024). p. 3982–6. doi: 10.1109/BIBM62325.2024.10821951

Crossref Full Text | Google Scholar

35. Miao J, Chen C, Zhang K, Chuai J, Li Q, Heng PA. Cross prompting consistency with segment anything model for semi-supervised medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer (2024). p. 167–77. doi: 10.1007/978-3-031-72120-5_16

Crossref Full Text | Google Scholar

36. Yang D, Ji J, Ma Y, Guo T, Wang H, Sun X, et al. SAM as the guide: mastering pseudo-label refinement in semi-supervised referring expression segmentation. arXiv [preprint]. (2024). arXiv:2406.01451. doi: 10.48550/arXiv.2406.01451

Crossref Full Text | Google Scholar

37. Zhang Y, Liao Q, Ding L, Zhang J. Bridging 2D and 3D segmentation networks for computation-efficient volumetric medical image segmentation: an empirical study of 25 D solutions. Comput Med Imaging Graph. (2022) 99:102088. doi: 10.1016/j.compmedimag.2022.102088

Crossref Full Text | Google Scholar

38. Yu L, Cheng JZ, Dou Q, Yang X, Chen H, Qin J, et al. Automatic 3D cardiovascular MR segmentation with densely-connected volumetric convNets. In:Descoteaux M, Maier-Hein L, Franz A, Jannin P, Collins D, Duchesne S, , editors. Medical Image Computing and Computer-Assisted Intervention - MICCAI 2017. MICCAI 2017. Lecture Notes in Computer Science, Vol. 10434. Cham: Springer (2017). p. 287–95. doi: 10.1007/978-3-319-66185-8_33

Crossref Full Text | Google Scholar

39. Menze BH, Jakab A, Bauer S, Kalpathy-Cramer J, Farahani K, Kirby J, et al. The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans Med Imaging. (2014) 34:1993–2024. doi: 10.1109/TMI.2014.2377694

PubMed Abstract | Crossref Full Text | Google Scholar

40. Tarvainen A, Valpola H. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems. arXiv. [preprint]. (2017). arXiv:1703.01780. doi: 10.48550/arXiv.1703.01780

Crossref Full Text | Google Scholar

41. Haynes D, Corns S, Venayagamoorthy GK. An exponential moving average algorithm. In: 2012 IEEE Congress on Evolutionary Computation. Brisbane, QLD: IEEE (2012). p. 1–8. doi: 10.1109/CEC.2012.6252962

Crossref Full Text | Google Scholar

42. Zhang Y, Lv B, Xue L, Zhang W, Liu Y, Fu Y, et al. SemiSAM+: rethinking semi-supervised medical image segmentation in the era of foundation models. Med Image Anal. (2025) 106:103733. doi: 10.1016/j.media.2025.103733

PubMed Abstract | Crossref Full Text | Google Scholar

43. Zeineldin RA, Karar ME, Coburger J, Wirtz CR, Burgert O. DeepSeg: deep neural network framework for automatic brain tumor segmentation using magnetic resonance FLAIR images. Int J Comput Assist Radiol Surg. (2020) 15:909–20. doi: 10.1007/s11548-020-02186-z

PubMed Abstract | Crossref Full Text | Google Scholar

44. Wang H, Guo S, Ye J, Deng Z, Cheng J, Li T, et al. Sam-med3d: towards general-purpose segmentation models for volumetric medical images. In: European Conference on Computer Vision. Springer (2024). p. 51–67. doi: 10.1007/978-3-031-91721-9_4

Crossref Full Text | Google Scholar

Keywords: foundation model, medical image segmentation, multi-view learning, segment anything model, semi-supervised learning

Citation: Li L, Wang B and Zhang H (2026) A foundation model-driven multi-view collaborative framework for semi-supervised 3D medical image segmentation. Front. Med. 12:1744097. doi: 10.3389/fmed.2025.1744097

Received: 11 November 2025; Revised: 09 December 2025; Accepted: 16 December 2025;
Published: 12 January 2026.

Edited by:

Ke Zou, National University of Singapore, Singapore

Reviewed by:

Nan Zhou, Sichuan University, China
Qingjie Zeng, Northwestern Polytechnical University, China

Copyright © 2026 Li, Wang and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Hong Zhang, aHpoYW5nMjFAemp1LmVkdS5jbg==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.