Improving brain tumor segmentation with anatomical prior-informed pre-training

Introduction Precise delineation of glioblastoma in multi-parameter magnetic resonance images is pivotal for neurosurgery and subsequent treatment monitoring. Transformer models have shown promise in brain tumor segmentation, but their efficacy heavily depends on a substantial amount of annotated data. To address the scarcity of annotated data and improve model robustness, self-supervised learning methods using masked autoencoders have been devised. Nevertheless, these methods have not incorporated the anatomical priors of brain structures. Methods This study proposed an anatomical prior-informed masking strategy to enhance the pre-training of masked autoencoders, which combines data-driven reconstruction with anatomical knowledge. We investigate the likelihood of tumor presence in various brain structures, and this information is then utilized to guide the masking procedure. Results Compared with random masking, our method enables the pre-training to concentrate on regions that are more pertinent to downstream segmentation. Experiments conducted on the BraTS21 dataset demonstrate that our proposed method surpasses the performance of state-of-the-art self-supervised learning techniques. It enhances brain tumor segmentation in terms of both accuracy and data efficiency. Discussion Tailored mechanisms designed to extract valuable information from extensive data could enhance computational efficiency and performance, resulting in increased precision. It's still promising to integrate anatomical priors and vision approaches.


. Introduction
Glioblastoma (GBM) is one of the most aggressive brain cancers among adults (1).Multi-parameter magnetic resonance imaging (MRI) provides valuable information for characterizing the size, invasiveness, and intrinsic heterogeneity of brain tumors (2,3).Accurate delineation of GBM on multi-parameter MRI is crucial for clinical diagnosis and treatment, such as assisting surgical planning for maximum glioblastoma resection while preserving neurological function.However, the current clinical routine still relies on manual delineation, which is time-consuming and requires expert knowledge.There is a high demand for automatic brain tumor segmentation to enhance the efficiency of diagnostic procedures, facilitate surgical planning, and contribute to prognostic analyses (4).
In the last decade, there have been extensive studies on automatic brain tumor segmentation (5), and most of them are based on convolutional neural networks (CNNs) (6)(7)(8).However, due to limited receptive field, CNNs often struggle to capture long-range dependencies and global context (9,10), potentially leading to inaccurate segmentation predictions.The recent success of transformer architecture in vision tasks (11,12) has shown benefits in learning global contextual information.New network designs with vision transformers have emerged for medical image segmentation (13,14) and achieved state-of-the-art (SOTA) performance in brain tumor segmentation (15)(16)(17).However, the supervised training of vision transformers typically requires a large amount of densely annotated images, otherwise there is a high risk of overfitting.
To combat the challenge of data scarcity in medical image segmentation, self-supervised learning (SSL) has proven to be a promising solution (18).In general, a pretext SSL task is designed to pre-train the network using unannotated data, and the learned encoder weights are further optimized in the downstream segmentation task.Since no manual annotation is needed for SSL, it can be applied to utilize large unannotated datasets.Recently, one of the most successful SSL frameworks is the masked language modeling (MLM), which has achieved great success in numerous natural language processing tasks with transformerbased architecture (19)(20)(21).Motivated by MLM, masked image modeling (MIM) was also proposed for pre-training vision transformers.In MIM, the model predicts masked image patches from unmasked patches.The prediction target can be either token features or raw pixel values of the masked patches.BEiT (22) utilizes a discrete variational autoencoder (dVAE) to transform all image patches into discrete tokens, which are then used to pretrain a vision transformer at the token level.However, tokenizing the image patches requires additional training of a dVAE.In contrast, He et al. (23) introduced the masked autoencoder (MAE), which randomly masks a subset of image patches and reconstructs the masked pixels from unmasked patches.The high masking ratio of MAE enables efficient pre-training of vision transformers with large annotated datasets.The success of MAE has motivated a series of variants in vision tasks (24)(25)(26)(27) and applications in medical image analysis using MIM techniques.For instance, Tang et al. (28) utilized masked inpainting for the pretraining of a Swin UNETR (Shifted-window UNet transformer) in abdominal segmentation tasks.Chen et al. (29) compared multiple MIM approaches in abdominal segmentation.Zhou et al. (30) applied MAE pre-training with UNETR (UNet Transformer) and obtained performance gains in both abdominal and brain tumor segmentation.
Building a masked image is a crucial step in MIM pretraining.As shown in Figure 1, the smallest masking unit of MLM, such as BERT (19), is typically the vocabulary, which preserves contextual information.However, MIM employs random masking, which can disrupt the spatial context and regions with the same semantic meaning, given the absence of the concept of words commonly observed in MLM.This, in turn, makes it challenging for the representation learning process to obtain high-quality pretrained network, especially when the masking ratio reaches a high percentage.Recently, several studies demonstrated that the masking strategy has a substantial effect on model performance in downstream tasks (31,32).Although random masking is widely used, recent advances have shown that appropriate masking strategies can achieve better performance, such as region-based masking (33), attention-based masking (34), and adaptive masking (AdaMAE) (31).These masking strategies take the patch context into account, leading to more effective and efficient pre-training.
In the context of medical images, anatomical knowledge could help improve the pre-training.Huang et al. (35) incorporated the symmetry characteristics of brain structures into the pre-training by constructing symmetric positional encodings.However, few studies have integrated the more precise brain atlas (36) into the masking strategy.Inspired by the performance gains achieved by weighted masking strategies, we propose an anatomical prior-informed masking strategy for the MAE pre-training.We hypothesize that the tumor distribution among brain structures can guide the MAE pretraining, therefore improving the downstream brain tumor segmentation.To achieve this, we analyze the tumor occurrence in the SRI-24 space and establish an anatomical prior-informed probability map for image masking.This strategy allows us to select more informative patches for MAE pre-training.By combining the data-driven MAE with anatomical knowledge, we aim to improve the accuracy and data-efficiency of brain tumor segmentation.
In this study, our contributions are as follows: (1) An anatomical prior-informed masking strategy is proposed to enhance the pre-training of masked autoencoder.This strategy is designed to preserve contextual information in 3D medical images and allows the pre-training process to concentrate on regions that are more relevant to the downstream segmentation task.(2) By incorporating prior-informed weighted sampling, we construct an anatomical prior-informed masked autoencoder, referred to as API-MAE.This self-supervised pre-training approach utilizes 6,415 skull-stripped brain T1 MR images and combines data-driven reconstruction with anatomical priors.(3) Inheriting the pretrained encoder weights, our method demonstrates superior performance in the downstream segmentation task on the BraTS21 dataset, outperforming several transformer models and surpassing state-of-the-art self-supervised learning methods.Subsequent experiments demonstrate that our method exhibits greater efficiency compared with a regular masked autoencoder and maintains a satisfactory trade-off between segmentation accuracy and computational consumption.

. Methodology . . Overview of proposed method
We propose a novel masking strategy for improved MAE pre-training and downstream brain tumor segmentation in

. Registration to standard brain template
To represent the anatomical priors, we first align all images with the standard brain template.The DICOM image data are transformed into Nifti format, and the brain is extracted using FSL tools (37).After that, we transform each image into the SRI-24 standard space (36) via affine registration.Using the optimized affine transformation matrix M * , all images are aligned in the SRI-24 space.
where I m represents the moving image, which corresponds to the MRI image of each sample.The fixed image, denoted as I f , refers to the T1 template of the SRI-24 standard space.In this study, the operation C(I m , I f ) represents the cost function used to quantify disparities between the fixed image I m and the moving image during the registration optimization process, where a correction ratio is applied (38).The notation Affine(I; M) signifies the affine operation that maps the floating image I to the fixed image using the affine matrix M.Moreover, I represents the output registrated image.

. . . Sampling weight map derived from brain tumor occurrence
We conduct a statistical analysis of enhanced tumor (ET) across BraTS21 dataset (39)(40)(41) and obatin a distribution map of ET occurrence in the SRI-24 standard space.To implement this analysis, we utilize a brain parcellation atlas building upon the parc116plus atlas (36).Some excessively small regions are merged into larger ones, resulting in 128 parcellation regions of the entire skull-stripped brain.To obtain the sampling probability of each voxel, the average sampling probability for each parcellation is defined as follows: where R i represents the i-th brain parcellation, P R i denotes the average sampling probability per volume of region R i , f i,j is the occurrence frequency of the ET region in the j-th voxel within the ith parcellation, V R i represents the volume of R i , and N R i represents the number of voxel in R i .Consequently, the sampling weight map W, depicted in Figure 3, can be generated by assigning voxels within the parcellation region R i the identical probability value P R i . .

. . . Anatomical prior-informed masking strategy
Instead of the random masking strategy used in standard MAE pre-training, we propose a dedicated masking strategy to select informative patches based on the derived sampling weight map.The input image I and sampling weights map W are center-cropped with a size of 128, i.e., I ∈ R 128×128×128 , W ∈ R 128×128×128 .Subsequently, I and W are transformed into patches represented as X = {x i } n i=1 and W = {w i } n i=1 , respectively.Here, n signifies the quantity of patches, and the patch size is configured at 8, a choice consistent with previous studies (35).This configuration leads to n = 16 × 16 × 16, aligning with the concept of vision transformers (12) splitting the 2D image into 16 × 16 tokens.The sample probability of each patch is determined by the probability vector p = [p 1 , p 2 , . . ., p n ] ⊺ , where p i = j w i,j / i,j w i,j , and w i,j denotes the sampling weight of the j-th voxel within the i-th patch corresponding to the voxels x i,j of the image patch.Consequently, the visible patches that are fed into the encoder can be sampled as follows: where X vis = {x i } k i=1 represents visible patches sampled from the original image patches X , and k = η • n represent the number of visible patches, η = 0.25 is the sampling ratio which aligned with the 75% masking ratio of MAE.The Sampling(X , p) operation involves utilizing a multinomial probability distribution with the probability vector p to select tokens from X for sampling, which then constitute the visible tokens.The sampling procedure is implemented using the multinomial API from PyTorch.As depicted in Figure 4, the prior-informed sampling maintains superior structural consistency compared to random masking, which is advantageous for the calculation of region-based sampling weights.

. . . Patch embedding
The input visible patches in X vis are first flattened into onedimensional vectors, then mapped to the feature dimension D via learnable patch tokenizer g(•).The input of the transformer encoder x enc is calculated as follows: where

. . . Transformer encoder
We adopt a shifted window vision transformer, known as SW-ViT (35), as the transformer encoder in API-MAE.As shown in Figures 3C, D, the multi-head self-attention (MSA) in the original transformer block is replaced with linear window-based multihead self-attention (LW-MSA) and shifted linear window-based multi-head self-attention (SLW-MSA) in the Swin transformer block.Both LW-MSA and SLW-MSA reduce parameters and computations among each head, which improves the network efficiency without significant accuracy loss.The transformer encoder serves as the feature extractor in API-MAE and the segmentation network.The output of the transformer encoder will undergo a linear projection to fit the higher feature dimension of the transformer decoder.

. . . Transformer decoder
We use a shallow transformer decoder to reconstruct the original image in API-MAE.The inputs to the decoder consist of both visible tokens and masked tokens with positional encodings.The output of the decoder is the reconstructed image tokens ŷi for each input patch.The reconstruction loss function is the standard L2 loss: where x i denotes the i-th image patch and m represents the number of masked tokens.It should be noted that only masked tokens are calculated for reconstructed loss.

. . . Reconstruction Discriminator
Recent advancements in self-supervised learning, such as DiRA (42), have demonstrated that the collaborative learning of self-supervised and adversarial tasks can lead to a more generalizable representation, encompassing fine-grained semantic representation.Moreover, discriminators have been proven beneficial for the masked autoencoder (32,43).In API-MAE, we introduced a reconstruction discriminator, envisioning its potential synergistic effect when integrated into MAE decoder.This combination aims to enhance the learning representation and improve visual quality of the reconstructed output.The discriminator is constructed as a shallower convolutional neural

Random Masking Prior Informed Sampling Probability Map of Sampled Regions
Origin Image

FIGURE
Example of visualizing a T MR image using di erent masking strategies with a masking ratio of . .This image is center-cropped with a shape of × × , and each token has a patch size of × × .
network, comprising five convolutional layers tasked with distinguishing between the reconstructed and real images.The adversarial loss employed for the discriminator is represented as an L2 loss as follows: where x i is the i-th image patch, ŷi is the corresponding reconstructed patch, and n is the token number of the original image.Thus, the total loss of API-MAE is a combination of reconstruction loss and adversarial loss as follows: .

. Segmentation network
After the pre-training of API-MAE, we discard the transformer decoder and keep the transformer encoder for the brain tumor segmentation task.The architecture of the segmentation network is shown in Figure 5.The segmentation network contains three parts as follows: (1) encoder, which contains patch embedding and transformer blocks, (2) encoder propagation, and (3) decoder.The patch embedding layer maps the input multi-parameter MRI (i.e., T1, T1Gd, T2-FLAIR, and T2 image) patches to the embedding features.The transformer blocks share the same architecture and are initialized with the pre-training weight of the transformer encoder in API-MAE.The encoder propagation and decoder parts utilize features from the original image (i.e., z 0 ) and specific transformer layers (2nd, 4th, 6th, 8th, and last layer, i.e., z 2 , z 4 , z 6 , z 8 , z 12 ) to propagate features and segment the image into three target classes as follows: whole tumor (WT), tumor core (TC), and enhanced tumor (ET).To obtain better segmentation, the segmentation network adopts cross-entropy and Dice loss with deep supervision as the segmentation loss as follows: where i represents the stage of deep supervision, S i denotes the prediction of stage i, and S i represents the ground truth resized to match the corresponding prediction.

. Experiments
We pre-train the MAE model on an unannotated brain MRI dataset and evaluate the segmentation performance on an annotated brain tumor MRI dataset. . .Datasets

. . . ADNI dataset
Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset ( 44) is derived from a longitudinal multicenter study aimed at early detection and tracking of Alzheimer's disease (AD).In this study, we collected 7,945 skull-stripped T1 MR images and subsequently handpicked 6,415 images of superior visual quality for utilization in the pre-training dataset.This selection was made following a visual inspection of the registration results.

. . Evaluation metrics
Both the volumetric metric dice similarity coefficient (DSC) and surface metric Hausdorff distance (HD) are used for performance evaluation.DSC quantifies the overlap between segmentation results and annotations in voxel space, while the 95 th percentile of Hausdorff distance (HD95) measures the distances between the segmentation surface and ground-truth surface.The calculation of HD95 is performed by the MedPy package using the analysis framework from nnFormer (45).

. . Implementation details
Experimental settings: All the experiments are implemented using the PyTorch 1.2 framework.We use 4 NVIDIA A100 GPUs (40 GB VRAM) for MAE pre-training and NVIDIA RTX3090 GPU (24 GB VRAM) for segmentation training and inference.
Data preprocessing: In the preprocessing section, we employ affine registration to align individual images with the standard space.Here, the cost function during the image registration optimization is correlation ratio (38).To prevent the registration results from being flipped upside down, we defined the rotation search space for affine registration as follows: [−30  ] for Y-axis rotation, and [−180 • , 180 • ] for Z-axis rotation.This configuration is aimed to emphasize rotation in the X-Y plane and prevent upside-down flipping along the Z-axis.It performed effectively with our dataset of 6,415 pretraining samples.The registration optimization and transformation processing were executed using the FLIRT (46) toolbox from FSL. Trilinear interpolation was utilized to compute the intensity of new voxels during affine mapping.For the pre-training data, we employ the MONAI (47) library for data normalization and cropping.Additionally, we utilize the segmentation data preprocessing pipeline provided by nnUNet (7), to handle the multi-modality segmentation data.
Model architecture: In API-MAE, the transformer encoder contains 12 layers of linear swin transformer blocks with a feature dimension D = 384.The transformer decoder comprises 8 layers of vanilla transformer blocks with a feature dimension of 384.The discriminator consists of four convolution blocks with a kernel size of k = 3 and a convolution block with a kernel size of k = 1.In the segmentation network, the weights of encoder propagation and decoder parts are initialized with the He initialization (48).
Model training: MAE training, the AdamW optimizer with a batch size of 12 is trained for 300 epochs.initial learning rate is 1e-3.Weight decay of 5e-2 is also adopted for model regularization.For the segmentation procedure, we apply the (45) training framework and default parameter for 1,000 epochs.

. . Pre-training results of anatomical prior-informed MAE
As presented in Figure 6, we note distinct differences in the spatial distribution of tumor occurrence within the SRI24 space.Specifically, gliomas are more frequently observed in the white matter regions of the middle and posterior sections of the brain, with comparatively lower frequencies in the brainstem and cerebellar regions.Table 1 shows the normalized probability of tumor occurrence among all 128 brain parcellations.Considering that ET is the most challenging region to segment, we employ the probability of the ET region for probabilistic masking.
The masking and reconstruction results are shown in Figure 7.It can be observed that random masking tends to distribute masked patches uniformly across the entire image, whereas our proposed weighted sampling strategy enables concentration on more valuable, concentrated, and relatively contiguous regions.The disruption of contextual information in random masking makes the reconstruction task challenging and results in a blurry reconstructed image.In contrast, the proposed weighted sampling method can maintain the integrity of semantic regions, allowing for better reconstruction results.

. . Segmentation results on BraTS dataset . . . Segmentation performance on BraTS dataset
To validate the effectiveness of the proposed SSL pretraining approach in downstream segmentation task, we conducted validation experiments using the BraTS21 dataset.The downstream brain tumor segmentation network is initialized with the pretrained API-MAE encoder weights and subsequently fine-tuned using the BraTS21 dataset.We conducted a comparison of The occurrence is expressed in permillage format.ET, enhanced tumor; TC, means tumor core; WT, the whole tumor.
As shown in Table 2, we observed that the pre-trained models demonstrate better performance, and our proposed API-MAE achieved the best performance in terms of the Dice similarity coefficient (DSC) metrics for whole tumor (WT) and tumor core (TC) and the best average performance of all three regions.

. . . Ablation study on masking strategies
To evaluate the effectiveness of our proposed masking strategy, we conduct an ablation study on different MAE masking strategies.The comparison methods include the baseline without pre-training, MAE pre-trained with random masking, and our proposed API-MAE pre-trained with anatomical prior-informed masking strategy.Table 3 shows that our proposed API-MAE showed improved performance for all regions compared with vanilla MAE and baseline.This demonstrates the effectiveness of our anatomical prior-informed masking compared with the random masking strategy.However, the marginal improvement indicates that in the presence of enough annotated data (more than 1,000 cases in BraTS21), transformer-based models already achieve satisfactory performance, and the benefit of pre-training is not substantial.

. . . Data-e ciency analysis
To validate the data efficiency of our pre-trained model, we further train the segmentation model on a small subset of the whole training dataset.We randomly sampled 100 cases from the   original training cases, while the validation and testing sets were kept the same as the whole dataset.The compared methods include the baseline without pre-training, MAE pre-trained with random masking, and our proposed API-MAE pre-trained with anatomical prior-informed masking strategy.The sampling process is repeated four times to mitigate the selective bias.
The segmentation results on the small training set are shown in Table 4.It is observed that MAE pre-training benefits the segmentation performance and improves the model robustness in most scenarios.The improvement by pre-training is more prominent in this small-dataset setting compared with the whole dataset.The best segmentation performance for ET and TC regions is obtained by API-MAE, in terms of DSC metrics, which matches the purpose of using ET occurrence map for weighted sampling.As shown in Figure 8, training with the MAE paradigm tends to reduce the erroneous falsely predicted regions and reduce the prediction error of ET regions, particularly in difficult-to-segment regions.To further investigate the efficiency of proposed method, we conducted an efficiency analysis of the segmentation phase for the methods, as shown in Table 5.Since different SSL methods share the same segmentation network, specifically SW-ViT, the variations in performance arise from the encoder weights inherited from diverse SSL pre-training tasks.This comparison involves distinct network architectures, namely, nnFormer, TransBTS, UNETR, and SW-ViT.All the methods were reproduced using the original code on a local server equipped with an AMD Ryzen 9 5900X CPU (3.7 GHz), 128 GB RAM (DDR4 2400MT/s), and an NVIDIA RTX3090 GPU.For fair comparison, we modified UNETR by adjusting its input channels to 4 and configuring the patch as 8 × 8 × 8, in alignment with SW-ViT.The computation consumption was calculated utilizing the thop package.This process entails inputting a tensor with dimensions of 2 × 128 × 128 × 128 into the network for computation and the standard segmentation procedure.
Combining the data from Tables 2, 4, we observe that nnFormer exhibits the best inference efficiency.This superiority can be attributed to the dimension of the embedding feature in the Transformer module of the network, which is [96,192,384,768].In contrast, other Transformer models often have embedding feature dimensions of 384 or 768.This relatively shallower transformer architecture contributes to its enhanced computational efficiency.However, it may result in slightly lower segmentation performance.Higher segmentation accuracy can be achieved in both WT and TC components in models with increased transformer layers.However, when using a high-layer transformer encoder such as UNETR, the number of floating point operations (FLOPs) and learnable parameters will increase rapidly.While the SW-ViT could reduce the FLOPs and parameters with the help of shifted window-based linear transformer modules.Enhanced with SSL pre-training tasks, particularly our proposed API-MAE, the methods using SW-ViT obtain the best segmentation performance while maintaining a favorable balance in terms of segmentation time consumption.Due to the presence of certain operations within the network architecture that do not parallelize efficiently during GPU computation, the proposed method does not achieve optimal computational efficiency on the GPU.However, the proposed method could attain decent CPU time consumption, which maintains a reasonable balance between accuracy and efficiency.

. Discussion
Recently, transformer-based models have emerged as stateof-the-art methods for 3D medical image segmentation, owing to their superiority in modeling long-range dependencies and leveraging global contextual information over fully convolutional neural networks.However, such methods often rely on a vast of training data for network optimization.A major challenge in training such models is the limited availability of annotated data.In this study, we address this challenge by utilizing 6,415 unannotated T1-weighted MR images from the ADNI dataset for pre-training.Our approach consistently improved the segmentation accuracy in scenarios with both large and small training sets.Although only T1-weighted images are used for pre-training, the learned weights benefit the downstream brain tumor segmentation on multi-parameter MRI.This highlights the potential of pre-training for improved medical image segmentation.
The MAE used in computer vision typically employs random masking with a high masking ratio of 0.75 and utilizes 25% unmasked patches for encoder training.The high masking ratio can lead to the loss of contextual information in high-dimensional medical images, making image reconstruction challenging and potentially affecting the learning of generalizable features.Therefore, it is important to consider tailored sampling strategies that take into account the specific characteristics and requirements of the task at hand.In this study, we introduce an anatomical prior-informed masking strategy, where brain regions with higher tumor occurrence are more frequently sampled for pre-training.The experiments demonstrate that our proposed pre-training method enhances the performance of brain tumor segmentation, which outperforms other self-learning approaches.This indicates that incorporating anatomical priors into the pre-training stage leads to performance improvements in downstream tasks.
Additionally, our anatomical prior-informed sampling strategy can be considered as an attention mechanism in selecting valuable and task-related patches for MAE pre-training.In general, attention mechanisms usually help models filter out high-value information from large amount of data, thereby improving computational efficiency and performance and making computing more precise and efficient.Given a large number of image patches in the unannotated dataset, it is important to let the pre-training process attend the informative patches.By incorporating the tumor occurrence rate and brain template into the construction of an attentive sampling strategy, our approach integrates anatomical priors with masked image modeling pre-training.This enables efficient sampling and the most use of unannotated data.
There are some limitations of this study.Our proposed method requires the pre-registration of the sampling weighting map for each individual, a process typically executed on the CPU and incurring a time cost.In future study, this procedure can be expedited through the utilization of deep learning-based networks, enabling accurate and rapid registration.We showcase the advantage of integrating anatomical priors during the pretraining stage, leveraging only tumor occurrence information.In future, the exploration of more advanced anatomical priors, such as symmetric brain structure or active learning strategies (50), holds potential for further investigation.

. Conclusion
In this study, we introduce a novel pre-training technique for brain tumor segmentation transformer networks.This technique involves the integration of an anatomical prior-informed masking strategy into the masked image modeling process.Informative image patches from brain parcellations with higher tumor occurrence are sampled more frequently, facilitating the mask autoencoder to focus on the regions of interest.The proposed approach demonstrates promising performance in the brain tumor segmentation task, surpassing compared self-learning methods.

FIGUREFIGURE
FIGURE Illustration of maksed modeling pre-training and masked image modeling(MIM).(A) Di erent masking result of masked modeling pre-training.The light white regions represents the masked regions and the colored reigons of axial brain MR image represents di erent independent brain areas.(B) The pre-training illustraion of masked autoencoder.(C) the whole procedure of mim.

FIGURE
FIGURE Architecture of Anatomical Prior-Informed Masked Autoencoder (API-MAE).(A) Is the architecture of Anatomical Prior-Informed Masked Autoencoder.(B) Is the Discriminator used for reconstruction.(C) If the Transformer block.(D) Is the linear Swin Transformer block.

FIGURE
FIGUREArchitecture of the baseline segmentation network.This network is made up of three parts, i.e., the Encoder part for feature extraction, the Encoder Propagation part used for channel and spatial normalization and skip connection, and the Decoder parts used for upsampling and predicting the segmentation results.The convolution blocks with skip= in the Encoder Propagation part are used for downsampling, and the UNetrUpBlock used in the decoder part is used for upsampling and each block contains a deconvolution block and two residual convolution blocks.

FIGURE
FIGUREThe occurrence frequency in SRI-standard space among , cases from BraTS dataset.The five columns represent the standard brain T MR image, brain atlas (enhanced parc plus) in SRI-Space, Enhanced Tumor (ET) occurrence, Tumor Core (TC) Tumor occurrence, and Whole Tumor (WT) occurrence, respectively.

FIGURE
FIGUREThe visible example of masking images and the reconstruction results of MAE and API-MAE.The five columns represent the origin brain T MR image, the random masking strategy used in MAE, the masking image generated from API token sampling, and the reconstruction results of API-MAE, respectively.

FIGURE
FIGURE Example of tumor segmentation results from a testing image with training cases.The three rows are from the axial, coronal, and sagittal views.The green region represents the necrotic tumor core (NCR), the blue region represents the Gd-enhancing tumor (ET), and the red region represents the peritumoral edematous/invaded tissue (ED).

Image Masking Image Reconstruction Pretraining Model 1
TABLE The normalized occurrence of tumor regions within di erent brain parcellations in enhanced SRI-atlas analyzed from , training cases of the BraTS dataset.
FLOPs stands for Floating Point Operations, which are recorded in units of gigaflops.Params refers to the learnable parameters of different network architectures, recorded in units of millions.Inference time is computed using an input tensor with dimensions of 2 × 128 × 128 × 128.
TABLE E ciency analysis.
TABLE Ablation study on the segmentation performance trained on the BraTS dataset.DSC means the Dice similarity coefficient, and HD95 means the 95th percentile Hausdoff distance.↑ indicates higher is better and ↓ indicates lower is better.Bold indicates the best performance.
Results from four independent sampling processes are reported with mean±std.↑indicates higher is better and ↓ indicates lower is better.Bold indicates the best performance.
(35)means the Dice similarity coefficient, and HD95 means the 95th percentile Hausdoff distance.↑indicateshigher is better and ↓ indicates lower is better.Bold indicates the best performance and the results of previous studies are adopted from(35).