Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Neurol., 06 February 2026

Sec. Artificial Intelligence in Neurology

Volume 16 - 2025 | https://doi.org/10.3389/fneur.2025.1707481

Automatic and standardized reporting of perioperative MRIs in patients with central nervous system tumors

  • 1Department of Health Research, SINTEF Digital, Trondheim, Norway
  • 2Department of Clinical Neuroscience, University of Gothenburg, Gothenburg, Sweden
  • 3Department of Neurosurgery, Sahlgrenska University Hospital, Gothenburg, Sweden
  • 4Department of Radiology and Nuclear Medicine, Amsterdam University Medical Centers, Vrije Universiteit, Amsterdam, Netherlands
  • 5Institutes of Neurology and Healthcare Engineering, University College London, London, United Kingdom
  • 6Department of Neurosurgery, Elisabeth-TweeSteden Hospital, Tilburg, Netherlands
  • 7Neurosurgical Oncology Unit, Department of Oncology and Hemato-oncology, Humanitas Research Hospital, Milano, Italy
  • 8Department of Neurological Surgery, University of California, San Francisco, San Francisco, CA, United States
  • 9Department of Biomedical Imaging and Image-guided Therapy, Medical University Vienna, Wien, Austria
  • 10Research Center for Medical Image Analysis and Artificial Intelligence, Faculty of Medicine and Dentistry, Krems, Austria
  • 11Department of Neurosurgery, Northwest Clinics, Alkmaar, Netherlands
  • 12Department of Neurosurgery, Medical University Vienna, Wien, Austria
  • 13Department of Neurosurgery, Haaglanden Medical Center, The Hague, Netherlands
  • 14Department of Neurological Surgery, Hôpital Lariboisiére, Paris, France
  • 15Department of Neurology and Neurosurgery, University Medical Center Utrecht, Utrecht, Netherlands
  • 16Department of Neurosurgery, University Medical Center Groningen, University of Groningen, Groningen, Netherlands
  • 17Department of Neurosurgery, Brigham and Women's Hospital, Boston, MA, United States
  • 18Harvard Medical School, Boston, MA, United States
  • 19Cancer Center Amsterdam, Brain Tumor Center, Amsterdam University Medical Centers, Amsterdam, Netherlands
  • 20Department of Neurosurgery, Amsterdam University Medical Centers, Vrije Universiteit, Amsterdam, Netherlands
  • 21Department of Neuromedicine and Movement Science, Norwegian University of Science and Technology (NTNU), Trondheim, Norway
  • 22Department of Neurosurgery, St. Olavs Hospital, Trondheim University Hospital, Trondheim, Norway
  • 23Department of Circulation and Medical Imaging, NTNU, Trondheim, Norway

Introduction: Magnetic resonance (MR) imaging is essential for diagnosing central nervous system (CNS) tumors, guiding surgical planning, treatment decisions, and assessing postoperative outcomes and complications. While recent work has advanced automated tumor segmentation and report generation, most efforts have focused on preoperative data, with limited attention to postoperative imaging analysis.

Methods: This study introduces a comprehensive pipeline for standardized postsurgical reporting in CNS tumors. Using the Attention U-Net architecture, segmentation models were trained, independently targeting the preoperative tumor core, non-enhancing tumor core, postoperative contrast-enhancing residual tumor, and resection cavity. In the process, the influence of varying MR sequence combinations was assessed. Additionally, MR sequence classification and tumor type identification for contrast-enhancing lesions were explored using the DenseNet architecture. The models were integrated seamlessly into an automated and standardized reporting pipeline, following the RANO 2.0 guidelines. Training was conducted on multicentric datasets comprising 2000 to 7000 patients, incorporating both private and public data, using a 5-fold cross-validation.

Results: Evaluation included patient-, voxel-, and object-wise metrics, with benchmarking against the latest BraTS challenge results. The segmentation models achieved average voxel-wise Dice scores of 87%, 66%, 70%, and 77% for the tumor core, non-enhancing tumor core, contrast-enhancing residual tumor, and resection cavity, respectively. Classification models reached 99.5% balanced accuracy in MR sequence classification and 80% in tumor type classification.

Discussion: The pipeline presented in this study enables robust, automated segmentation, MR sequence classification, and standardized report generation aligned with RANO 2.0 guidelines, enhancing postoperative evaluation and clinical decision-making. The proposed models and methods were integrated into Raidionics, open-source software platform for CNS tumor analysis, now including a dedicated module for postsurgical analysis.

1 Introduction

Brain tumors encompass a diverse range of neoplasms with highly variable prognoses, ranging from benign to highly aggressive forms. The World Health Organization (WHO) currently classifies over 100 subtypes based on molecular and histological profiles (1). While some patients experience prolonged survival, many face rapid neurological and cognitive decline (2). Accurate tumor characterization is crucial for prognosis, treatment planning, and surgical decision-making. However, the inherent biological complexity of CNS tumors presents significant challenges. CNS tumors are broadly categorized as primary or secondary. Primary tumors, such as gliomas and meningiomas, originate within the brain or its supporting tissues. Gliomas include the most aggressive and treatment-resistant forms (e.g., glioblastoma), as well as slowly progressive yet highly infiltrative forms ultimately undergoing malignant transformation (e.g., diffuse lower-grade) (3, 4). Secondary tumors result from metastatic spread to the brain. Magnetic Resonance (MR) imaging is essential for tumor diagnosis, prognosis estimation, and treatment planning. Imaging-derived features such as volume, location, and structural involvement guide surgical resection, estimate postoperative risks, and inform adjuvant therapy (5). These features also underpin predictive models for clinical research and personalized care (68). Despite advances in imaging, tumor characterization remains largely subjective. Manual segmentation, the current gold standard, remains labor-intensive and prone to intra- and inter-rater variability (9) and thus rarely performed in routine practice. Instead, tumor attributes are often estimated visually or using crude methods (e.g., eyeballing or short-axis diameter measurements), introducing significant inconsistencies and reducing clinical utility (10). The lack of automated and standardized segmentation limits the integration of imaging-based biomarkers into clinical workflows, and represents a weakness for clinical trial assessment both at baseline and for estimating treatment responses. Robust computational methods are needed to improve precision, reproducibility, and efficiency, bridging the gap between imaging, molecular diagnostics, and personalized treatment strategies.

Post-operative MRI is critical for evaluating surgical outcomes, planning adjuvant therapy, and monitoring disease progression. However, altered anatomy, resection cavities, and postoperative blood products greatly complicate segmentation. Residual tumors are often small and very fragmented compared to preoperative tumor cores. The BraTS challenge, long focused on preoperative segmentation, was extended in 2024 to postoperative cases (11), introducing annotations for the enhancing tissue (ET), non-enhancing tumor core (NETC), surrounding non-enhancing FLAIR hyperintensity (SNFH), and resection cavity (RC). Top-performing methods employed CNN and Transformer-based architectures [e.g., nnU-Net (12), Swin U-Netr (13)], ensembles [e.g., STAPLE (14)], and synthetic data augmentation (15). Nonetheless, performance remains significantly lower than in the preoperative setting, with lesion-wise Dice scores reaching up to 78% for NETC, 76% for ET, and 71% for RC. Other studies focusing on residual tumor segmentation attempted training from scratch (16), through active learning (1719), or transfer learning from a preoperative model (20). In those studies, leveraging local datasets not publicly available, lower Dice scores were reported, at around 50%–60% Dice for ET. This highlights a considerable performance gap compared to the typical BraTS challenge results, underscoring greater image and annotation variability in the postoperative setting. Postprocessing is often applied to remove artifacts and enforce anatomical consistency (e.g., NETC within TC). Resection cavity segmentation remains underexplored: studies on limited datasets reported 80% Dice score using a DenseNet variation (21) or 3D U-Net with longitudinal data (22). Alternatively, resection cavities were simulated before being segmented using self-supervised and semi-supervised learning (23). Overall, postoperative segmentation tasks remain unsolved, highlighting the need for more robust methods. Beyond segmentation, standardized imaging reports are essential. The lack of structured reporting systems for glioma patients was identified as a major limitation in 2018 (24). Although structured assessment has the potential to improve patient care and enhance both communication and decision-making, few solutions have been proposed. A preoperative reporting framework was introduced in the Raidionics platform (25), leveraging features from automatic segmentation for major CNS tumor types. Regarding postoperative assessment, prior works proposed new scoring methods and assessment (24, 26), but these were not derived from automated MR analysis. With growing interest in postoperative segmentation and emerging models for ET and RC, there is a clear opportunity for advancing postoperative standardized and automatic reporting into readily available solution.

This work introduces a comprehensive pipeline for automated postsurgical assessment in CNS tumors, with the following contributions: (i) development of robust segmentation models for tumor core, non-enhancing tumor core, residual tumor, and resection cavity, (ii) thorough validation and benchmarking against BraTS, including ablation studies on required MR sequences, (iii) extension and refinement of image-based standardized reporting in line with RANO 2.0 guidelines, and (iv) integration into the Raidionics software, providing access to the latest segmentation, classification, and reporting methods.

2 Data

All data used in this study were obtained from a previously described private dataset (27), and the publicly available datasets from the BraTS challenges (11, 2830). Given the diverse segmentation and classification tasks addressed in this study, different specific subsets were compiled, as detailed below. An overview of the segmentation subsets is provided in Table 1. First, for the MR sequence classification task, a total of 1, 000 MR scans were randomly selected for each combination of MR sequence type and acquisition timestamp (i.e., preoperatively and postoperatively), resulting in a final subset of 8, 000 MR scans. The MR sequence types considered are: gadolinum-enhanced T1-weighted (noted t1c), T1-weighted (noted t1w), FLAIR (noted flair), and T2-weighted (noted t2w). Second, for the contrast-enhancing tumor type classification, a total of 500 pre-operative t1c MR scans were randomly selected for each class (i.e., glioblastoma, meningioma, and metastasis), resulting in a final subset of 1, 500 MR scans. Finally, for the segmentation task, four distinct subsets were assembled using a mixture of patient data at different timepoints. Each subset targets a specific structure to segment, the tumor core for contrast-enhancing tumors (i.e., TC) in dataset A, the non-enhancing tumor core (i.e., NETC) in dataset B, the residual tumor (or enhancing tissue) in dataset C, and finally the resection cavity in dataset D (as illustrated in Figure 1). In Table 1, the total number of unique patients, as well as the total number of investigation from different timestamps, are summarized. Additionally, the incremental decrease of number of investigations, arising from the inclusion of additional MR sequences, is also reported. Different cut-off values were applied to determine whether the segmentation target was sufficiently visible in a given MR scan to be considered a positive sample: 0.175 ml for postoperative residual tumor (31), 0.05 ml for NETC, 0.1 ml for resection cavity, and 0.1 ml for TC. For the private data, tumor core, residual tumor, and resection cavities were manually segmented in 3D by trained raters. On preoperative t1c scans, the tumor core region was defined as gadolinum-enhancing tissue, necrosis, and cysts. On postoperative t1c scans, the enhancing tissue region was defined as gadolinum-enhancing tissue. Finally, the resection cavity was defined as regions with a signal isointense to cerebrospinal fluid, potentially including air, blood, or proteinaceous materials when recent. A more detailed description of each cohort and dataset is available in Supplementary Section 1.

Table 1
www.frontiersin.org

Table 1. Overview of the datasets used for the different segmentation tasks.

Figure 1
MRI brain scans showing pre-operative and post-operative states. PreOp images display a tumor core marked in red and a non-enhancing tumor core. PostOp images show a residual tumor as enhancing tissue, cavity T1c, and cavity flair, with red highlighting the affected areas.

Figure 1. Illustration of the different structures targeted for model segmentation training: tumor core (for contrast-enhancing tumors), non-enhancing tumor core, postoperative residual tumor (enhancing tissue), resection cavity in t1c MR scan, and resection cavity in FLAIR MR scans (from left to right).

3 Methods

3.1 Segmentation models training

Segmentation models were trained following the pipeline illustrated in Figure 2, with preprocessing variations based on the available input MR sequences. Each step of the pipeline is further detailed in the following paragraphs.

Figure 2
MRI brain scan training workflow illustration. It consists of four stages: Inputs (various MRI scans), Preprocessing (combination of scans to generate a new image), Training (input into AttentionUNet model), and Postprocessing (comparison of raw and cleaned predictions with manual annotations).

Figure 2. Illustration of the segmentation model training pipeline including the following four steps: MR sequence input selection, preprocessing, training using the Attention U-Net architecture, and finally post-processing.

3.1.1 Input selection and preprocessing

Given the possibility for one or multiple MR scans to be available for any given patient, incremental combinations of input MR sequences are made possible. Additionally, pairs of input sequences can be subtracted to generate new input channels. The inclusion of such subtraction-based inputs is indicated by an additional marker (d) within the experiment name (e.g., t1d for subtraction between t1c and t1w scans). The subtraction between t1c and t1w scans is inspired by the best-performing team for meningioma segmentation in the BraTS 2024 challenge and by the official annotation protocol from the BraTS challenge (11). Such input was provided to help identify subtle areas of enhancement and to distinguish areas of intrinsic T1 hyperintensity from enhancement.

The following preprocessing steps were applied in the specified order: (i) resampling to an isotropic voxel spacing of 1mm3 using first-order spline interpolation, (ii) tight cropping around the patient's head to exclude the background, (iii) subtraction of two input sequences to create a new difference input channel (when applicable), (iv) intensity clipping within the range [0, 99.5]%, and (v) zero-mean normalization of all nonzero values.

3.1.2 Architecture design and training strategy

The Attention U-Net architecture (32) has been used in this work with the following specifications: 5 levels, filter sizes of [16, 64, 128, 256, 512], instance normalization, a dropout rate of 0.2, and an input size of 128 × 128 × 144 voxels.

The loss function was the combination of Dice and Cross-Entropy, with a sigmoid final activation for single-class segmentation tasks (i.e., including the background). Model training was conducted using the AdamW optimizer with an initial learning rate of 5e − 4, combined with an annealing scheduler. Training was performed over 800 epochs with a batch size of 16 elements, using a 2 step gradient accumulation strategy to achieve an effective batch size of 32 elements.

For data augmentation, a random crop of 128 × 128 × 144 voxels was applied to each input sample. Subsequently, a random combination of geometric, each with a 50% probability to happen over any given axis, and intensity-based transformations with a 50% probability were performed. The geometric transformations include flipping, rotation within the range [−20°, 20°], translation of up to 20% of the axis dimension, and zoom scaling of up to 15%. The intensity-based transformations include intensity scaling and shifting (up to 10%), gamma contrast adjustments in the range [0.5, 2.0], Gaussian noise addition, and patch dropout or inversion with patch sizes of 10 × 10 × 10 voxels and up to 75 elements.

3.1.3 Inference and postprocessing

For the inference step, a sliding-window approach with 50% overlap between consecutive patches along each spatial dimension was performed. Unlike the patch size employed during training, the inference patch size was set to 160 × 160 × 160 voxels. During the inference process, no data augmentation was performed over the input samples. Subsequently, a two-step postprocessing refinement was designed to clean the predictions. With the first step, only the prediction probabilities lying inside a binary mask of the brain location were kept. Second, noise in prediction probabilities was removed by identifying connected component predictions with an area lower than 0.05ml3 or not visible in consecutive 2D slices.

3.2 Single timepoint and surgical standardized reporting

The proposed pipeline for standardized surgical reporting, illustrated in Figure 3, starts with a classification step to automatically identify the MR sequence for all provided input scans and the type of contrast-enhancing tumor (if applicable). Then, segmentation of multiple structures is performed using the latest models, with the possibility to include an extra step of ensembling to generate more robust results. Finally, a surgical reporting can be generated, complementing the standardized reporting computed for each timepoint. Each step is described in-depth in the remainder of the section.

Figure 3
Flowchart illustrating brain MRI analysis for surgical reporting. Sections include Inputs with PreOp and PostOp MRIs, Classification with DenseNet121 for tumor types, Segmentation with AttentionUNet for tumor and cavity identification, Ensemble averaging predictions, and Surgical Reporting for volume measurements and resection extent, outputting in text, JSON, or CSV.

Figure 3. Pipeline illustration for surgical reporting using any given set of input MR scans. The different steps are: MR sequence classification, contrast-enhancing tumor type classification, input-agnostic structures segmentation, model ensembling, and finally reporting generation (per timepoint and post-surgical).

3.2.1 MR scan sequence and contrast-enhancing tumor type classification

In order to automatically assign the proper MR sequence to each input scan and identify the contrast-enhancing tumor type, 3D classification model training was performed. The DenseNet121 architecture (33) has been used with 64 filters in the first convolution layer, a growth rate of 32, batch normalization, and an input size of 128 × 128 × 144 voxels. A single input was provided to the DenseNet121 model, the original t1c MR scan.

The loss function was Cross-Entropy and the multi-class accuracy was employed as metric. Model training was conducted using the AdamW optimizer with an initial learning rate of 5e − 4, combined with an annealing scheduler. Training was performed over 800 epochs with a batch size of 8 elements, using a 4 step gradient accumulation strategy to achieve an effective batch size of 32 elements. The same data augmentation techniques as described in the previous section were used.

3.2.2 Segmentation and model ensembling

The inference process is the one previously described and by default no runtime data augmentation nor model ensembling is performed. Both options can be enabled by the user in order to improve the robustness of the segmentation results only at the expense of longer processing time. The available runtime data augmentation techniques, using the same parameters as during training, include axis flipping, rotation, and gamma contrast. For model ensembling, the segmentation results from one to five models can be combined by returning either the mean probability (i.e., average option) or the maximum probability (i.e., amax option) for each voxel.

3.2.3 Global structures' segmentation refinement

As each structure segmentation model was trained independently, a refinement step leveraging global context is appropriate to ensure consistency across all structures. In addition to the models described in this study, the FLAIR hyperintensity (i.e., SNFH) segmentation models, trained using the training protocols (34), were included in this step.

For contrast-enhancing tumors: Preoperatively, the tumor core mask is kept unchanged and used as reference. The NETC mask is refined by retaining only regions that overlap with the tumor core mask, and the SNFH mask is modified by subtracting the tumor core region. The non-overlapping tumor core and SNFH masks can be combined to form the whole tumor mask. Postoperatively, the residual tumor mask is kept unchanged and used as reference. The resection cavity and enhancing tissue masks are subtracted from the SNFH mask.

For non contrast-enhancing tumors: No preoperative tumor core or postoperative residual tumor masks are available. In both preoperative and postoperative settings, the SNFH mask serves as the whole tumor mask. Specifically postoperatively, the resection cavity mask is subtracted from the SNFH mask.

3.2.4 Standardized reporting

Standardized reports for single timepoints (i.e., preoperative and postoperative) are computed in the same way as described in our previous study (27). The major variation comes from the selection of segmentation models used for the different use-cases, when only a single tumor model was available before. For contrast-enhancing tumors, the tumor core segmentation model operating over gadolinum-enhanced T1-weighted MR scans is used preoperatively and the residual tumor segmentation model is used postoperatively over up to four MR sequences. For non contrast-enhancing tumors, the unified SNFH segmentation model is used (34). For all tumor types, the unified resection cavity and necrosis segmentation models are used over up to four MR sequences. The set of computed features has been extended to include diameter characteristics (i.e., long-axis, short-axis, Feret, and equivalent area), tumor-to-brain ratio, and the Brain-Grid classification system for cerebral gliomas (35).

The standardized surgical report features the same distinction between contrast-enhancing and non contrast-enhancing tumors. Preoperative and postoperative volumes for all segmented structures are reported, when applicable. In addition, post-surgical volumetric evolution percentages are reported for each segmented structure, which is the equivalent to the extent-of-resection when the considered structure is the tumor. Finally, the overall surgical assessment, i.e., complete, near total, or subtotal resection, is provided to complement the standardized surgical report, following the latest RANO guidelines (31).

4 Validation studies

In this work, the focus lies primarily on assessing segmentation and classification models' performance, and as such four experiments were conducted. A single training protocol has been followed for all presented models, namely a 5-fold cross-validation. All datasets were randomly split into 5 folds, simply enforcing for a single patient's data not to be featured in multiple folds when MR scans were available at multiple timestamps. During training, 3 folds were used as training set, one fold as validation set, and the remaining fold as test set, following an iterative process. For the validation studies, both postprocessing steps were used for the residual tumor structure, and only the first step was applied for all other structures.

4.1 Metrics

For quantifying and comparing models' performances, the following patient-wise, voxel-wise, and object-wise metrics were computed. Each metric was computed between the ground truth volume and a binary representation of the probability map generated by a trained model. The binary representation is computed for ten different equally-spaced probability thresholds in the range [0, 1]. For the probability threshold providing the best results, pooled estimates computed from each fold's results are then reported for each metric (36) to provide overall results, reported as mean and standard deviation (indicated by ±) in the tables. Voxel-wise and object-wise results are only reported for the positive samples following the definition provided below.

4.1.1 Patient-wise

Patient-wise metrics assess the classification ability of a given segmentation or classification model. For segmentation models, the cut-off volume values presented in the Data section were used to determine positive cases (i.e., including the structure of interest) from negative cases (i.e., not include the structure of interest). Furthermore, a voxel-wise Dice overlap of simply 0.1% between model prediction and ground truth was required for a positive case to be considered true positive (TP). From the identification of true/false positive/negative samples at a patient level, the following metrics were then computed: recall, precision, specificity, and balanced accuracy (noted bAcc).

4.1.2 Voxel-wise

Voxel-wise metrics assess the ability of a segmentation model by considering each voxel independently. The following metrics were computed between the ground truth volume and the binary model prediction: Dice score, recall, precision, and 95th percentile Hausdorff distance (noted HD95).

4.1.3 Object-wise

Object-wise metrics assess the ability of a segmentation model to detect all components of the structure to segment (i.e., multiple tumor components). The trained models not being instance segmentation models, a connected components approach coupled to a pairing strategy was employed to associate ground truth and model predictions' components. A strict assignment was performed using the Hungarian algorithm. A minimum component size of 75 voxels (down to 50 voxels for the NETC structure) was enforced and objects below the threshold were discarded. Dice score, recall, precision, and 95th percentile Hausdorff distance were computed.

4.2 Experiments

First, (i) classification performance analyses for MR sequence and tumor type identification were executed. Next, (ii) a performance analysis of the contrast-enhancing tumor core segmentation model and NETC segmentation model was performed. Then, (iii) a postoperative segmentation performance experiment was conducted to identify the impact of using a varying number of MR sequences as input and present the best performing models for the enhancing tissue and resection cavity categories. Ensuing, (iv) a detailed analysis was carried out to highlight performance differences across the different cohorts in our dataset, as well as in comparison with the BraTS challenge for external benchmarking. Finally, (v) a comprehensive investigation of generalizability will be performed with identification of hard cases and annotation inconsistencies. Sensitivity analysis will also be conducted to evaluate the impact of detection criteria.

5 Results

All experiments were performed using computers with the following specifications: Intel Core Processor (Broadwell, no TSX, IBRS) CPU with 16 cores, 64GB of RAM, Tesla A40 (46 GB), or A100 (80GB) dedicated GPUs, and NVMe hard-drives. Training and inference processes were implemented in Python 3.11 using PyTorch v2.4.1, PyTorch Lightning v2.4.0, and MONAI v1.4.0 (37). The source code used for computing the metrics and performing the validation studies, all trained models, and inference code, are publicly available through our Raidionics platform (25).

5.1 Classification performances

Classification performances for both MR sequence and tumor type tasks are summarized in Table 2. From the use of a large dataset of 8, 000 samples, MR sequence classification performances are almost perfect with a 99% balanced accuracy score. On the other hand, classification performances for the contrast-enhancing tumor type are only reaching 85% balanced accuracy. The results can be explained by the use of a relatively smaller dataset, when compared to the MR sequence classification dataset, and by the more difficult nature of the task when only leveraging the t1c MR scans.

Table 2
www.frontiersin.org

Table 2. Multiclass-averaged classification performance for distinguishing between MR sequence type (Sequence) and between contrast-enhancing tumor type (Type).

To further investigate where the classification models failed, confusion matrices were computed for each task (cf. Figure 4). For the MR sequence task, frequent misclassifications occurred between gadolinum-enhanced T1-weighted and regular T1-weighted MR scans. Similarly, confusion was common between FLAIR and T2 MR sequences. These results are expected, as each of these pairs shares similar physical imaging characteristics. Upon detailed review of the 93 misclassified cases, it was found that in 61 instances the ground truth labels were incorrect, while the model predictions were actually accurate. Therefore, only 32 cases (i.e., 0.4%) were truly misclassified. These errors were primarily attributed to extreme cropping, imaging artifacts, noise, or blurriness (as illustrated in Figure 5). Regarding the tumor type task, both glioblastomas and meningiomas were usually correctly classified. The largest confusion came from the metastasis group, more often misclassified as either of the two other groups.

Figure 4
There are two confusion matrices side by side. The left matrix, labeled “MR Sequence,” shows MR sequence classification with high accuracy along the diagonal, especially for T1w with 2037 correctly classified. The right matrix, labeled “CE Tumor Type,” shows classification of tumor types with notable values for Glioma at 410 and Metastasis at 367, indicating prediction accuracy. Each matrix uses a gradient of green shades to represent frequencies.

Figure 4. Confusion matrices for both classification tasks where the predicted class is shown in the x-axis and the ground truth class in the y-axis.

Figure 5
A series of six brain MRIs showing where classification failed. The images include T1c(1Tw), T1w(T1c), T1w (T1c), FLAIR(T2), T1c(FLAIR), and T1w(T1c). The predicted MR sequence is noted first and the correct MR sequence is in parenthesis

Figure 5. Examples of misclassified MR scans due to extreme motion artifacts, reconstruction artifacts, noise, or cropping. The predicted MR sequence is indicated first and the ground truth MR sequence is specified in parenthesis.

5.2 Contrast-enhancing tumor core and NETC segmentation performances

Tumor core segmentation performance in preoperative MR scans for contrast-enhancing tumors are reported in Table 3. Overall patient-wise metrics indicate an almost perfect detection rate with 99% recall and precision. However, from the heavy imbalance between positive and negative samples in dataset A, the balanced accuracy and specificity were negatively impacted. Both voxel-wise and object-wise Dice scores reached high values above 85%, indicating strong efficiency both on average and for each tumor type individually. The different metrics are quite homogeneous across the board. The lowest voxel-wise Dice score (85%) was obtained for the metastasis group while the highest (89%) was obtained for the glioma group. Supported by the average volumes reported in Supplementary Table S1, larger tumors exhibit higher Dice scores on average. Interestingly, the object-wise Dice score for the metastasis group is almost as high as for the other two groups. Despite metastasis being on average smaller and more fragmented than meningioma and glioma, the results showcase high and stable segmentation performance. A deeper performance investigation based on tumor type and cohort is available in Supplementary Table S5 and Supplementary Figure S1.

Table 3
www.frontiersin.org

Table 3. Preoperative contrast-enhancing tumor core segmentation performances, over t1c MR scans from dataset A.

Segmentation performances for the NETC category are reported in Table 4. The incremental addition of MR scans as input seemed to have a marginal effect only on the average performances. Both average pixel-wise and object-wise Dice scores went from 65.6% with one input sequence to 66.8% with three. Given the higher negative-to-positive sample ratio in dataset B, patient-wise specificity and balanced accuracy are above 70% and 80% respectively. The final addition of FLAIR slightly worsened overall performance with Dice score values around 65.5% yet providing a higher recall than using t1c sequences only. The very fragmented nature of the NETC structure is clearly visible from the large difference between pixel-wise and object-wise HD95 values. By removing all fragments below a predefined volume threshold, as conventionally done in the BraTS challenge, the distance between the largest paired components is drastically reduced. A deeper performance investigation based on tumor type and NETC volume is available in Supplementary Tables S6, S7, Supplementary Figures S2, S6.

Table 4
www.frontiersin.org

Table 4. Non-enhancing tumor core segmentation performances over dataset B, for different sets of MR sequences as input.

5.3 Postoperative segmentation performances

First, segmentation performances for the contrast-enhancing residual tumor category, in postoperative MR scans, are reported in Table 5. The best performances are achieved for the model trained using all four input MR sequences with a 70% pixel-wise and object-wise average Dice score. An overall decrease of 7% Dice can be noticed when training a model using a single MR sequence as input, and each input sequence iteratively boosts performances by 1%–2% Dice. Looking at the predictive performances, the model using only t1c inputs exhibits oversegmentation and struggles to properly identify patients with residual tumor from patients without. A substantial improvement in patient-wise specificity and balanced accuracy is obtained from using both t1c and t1w MR sequences as input with a respective 13% and 6.5% increase. Having access to information from two different but highly correlated MR sequences helped the model to better differentiate between residual tumor and blood products. However, the best patient-wise specificity score reached only 70% with all four input MR sequences, highlighting the struggle not to oversegment given the very fragmented nature of residual tumor.

Table 5
www.frontiersin.org

Table 5. Overall segmentation performance summary for contrast-enhancing residual tumor over dataset C, with incremental inclusion of MR sequences as input.

Segmentation performances for the resection cavity in postoperative MR scans are reported in Table 6. The resection cavity often comprise a single region and as such object-wise metrics are closely matching the pixel-wise metrics. The lowest object-wise Dice score is obtained when using a single FLAIR MR scan as input, reaching only 66.5%. The peculiarity of this use-case is the mixture of resection cavities in patients suffering from glioblastoma or diffuse lower-grade glioma, potentially rendering the task more difficult from higher brain structure and appearance variability. Regarding the sequential inclusion of MR sequence from t1w to t2w, performances are almost identical using a single input or the four MR sequences, hovering around 78% object-wise Dice. The specific addition of the t1d input did not provide any difference in voxel-wise or object-wise performances. A slight improvement in patient-wise metrics can be noticed over specificity and accuracy. From the patient-wise results, the classification ability is relatively poor with around 25% specificity and 60% balanced accuracy. As highlighted by the dataset distribution in Supplementary Table S4, the positive rate lies at 90%. Hence, very few negative samples are provided to the model during training for mitigating such oversegmentation behavior. Yet, the patient-wise precision is relatively high with 90%, indicating that not many patients were misclassified regarding the presence or absence of a resection cavity. The specificity values are negatively impacted by the very shallow pool of negative samples in the dataset.

Table 6
www.frontiersin.org

Table 6. Overall segmentation performance summary for the resection cavity in dataset D, for different sets of MR sequences as input.

An overview of the models' performance for each structure of interest is provided in Figure 6, showing examples of best to worse results from left to right. Supplementary Figure S5 is a magnified version focusing on the region of interest for each case. For large tumors with a clear contrast-enhancing rim, the NETC model segmented almost perfectly with pixel-wise Dice scores above 95%. When the rim is not visible, potentially in presence of low-/non-contrast-enhancing tumors, the model tended to struggle. Regarding the contrast-enhancing tumor core segmentation model, clearly defined CNS tumors were almost perfectly segmented regardless of size (i.e., glioblastoma or metastasis). Identified cases of struggle often exhibit CNS tumors in unusual location (e.g., around the brain stem) not featured enough in the dataset. Next, the postoperative residual tumor model faced the challenge of segmenting small, fragmented, and not clearly confined structures. Oftentimes only part of the residual tumor was correctly segmented, omitting other smaller components around the cavity. The larger pieces of non-resected tumor, similar to the contrast-enhancing tumor core, were more easily segmented. Finally, the resection cavity segmentation model was more deficient when presented with inhomogeneous cavities displaying varying intensity levels (cf. right-most example in the last row of the figure).

Figure 6
A series of MRI brain scans displaying various tumor sizes and locations. The manual annotation is in blue and model prediction in red. Accuracy percentages and volume measurements in green are overlaid each scan, displaying analysis results. The images are arranged in rows, showing cross-sections of the brain.

Figure 6. Illustration showing the ground truth (in blue) against the produced prediction (in red) for the NETC, tumor core, residual tumor, and resection cavity from top to bottom. The resulting Dice score and total volume to segment are given in green (image best viewed digitally and in color).

5.4 Segmentation performances analysis across cohorts and BraTS challenge benchmarking

The comparison of the contrast-enhancing tumor core segmentation performances, against our previous baseline and the latest BraTS challenge results from 2024, is compiled in Table 7. In our previous study, a specific tumor core segmentation was trained for each tumor type, while a unified model is proposed in this article for all contrast-enhancing tumors. For both the glioma and meningioma groups, the voxel-wise performances have improved from training a unified model, with a Dice score gain of 5% and 2%. The results over the glioma category are quite meaningful to interpret, given the relative high magnitude of both datasets with more than 2, 000 patients included. However, the difference in dataset size for both the meningioma and metastasis categories makes for a difficult direct comparison, as tumors' expression and variability might greatly differ.

Table 7
www.frontiersin.org

Table 7. Preoperative contrast-enhancing tumor core segmentation performances, compared to the latest BraTS challenge performances and our previous baseline (25).

Segmentation performances of postoperative contrast-enhancing residual tumors across cohorts, with more than 100 samples, have been reported in Table 8. For fair comparison with the BraTS challenge, performances are reported for the model trained using all four MR sequences as input. A complete report over all cohorts is available in the Supplementary Table S8). A direct comparison with the results from the BraTS 2024 challenge is not possible since the test set is not publicly available. As a result, the latest official leaderboard results over the validation set (available on Synapse), were used in the table for benchmarking purposes. An average lesion-wise Dice score of 76.3% was obtained from the best-performing team for the enhancing tissue category, computed over 188 samples. From our model, averaged over the 1,316 available samples from the BraTS challenge, an object-wise Dice score of 79.7% is obtained, hence on-par with the state-of-the-art. However, large performance variations can be noticed between the main dataset's cohorts. The pixel-wise Dice score decreases to 61.6% for the STO cohort and goes further down to 47% over the SUH cohort. A similar trend is visible across the different cohorts for the patient-wise and object-wise metrics. The performance drop seems to be highly correlated with the average residual tumor volume over each cohort (cf. Supplementary Table S3) going from 15 ml over the BraTS cohort down to 3.7 ml for the SUH cohort.

Table 8
www.frontiersin.org

Table 8. Contrast-enhancing tumor segmentation performances for cohorts with at least 100 patients, compared to the latest BraTS challenge performances, using all four input sequences.

For the resection cavity, performances over cohorts with at least 100 samples are reported in Table 9, using the same method as described above to benchmark against the BraTS challenge 2024. A complete report over all cohorts is available in the Supplementary Tables S9, S10). An average lesion-wise Dice score of 71.5% was obtained from the best-performing team for the resection cavity segmentation. From our best model over the BraTS cohort, an object-wise Dice score of 76.33% is obtained. The inter-cohort variability is much lower with average Dice scores around 80%. Interestingly, as reported in Supplementary Table S4, average resection cavity volumes are relatively similar across all cohorts, at around 20 ml.

Table 9
www.frontiersin.org

Table 9. Resection cavity segmentation performances for cohorts with at least 100 patients, compared to the latest BraTS challenge performances, using all four input sequences.

5.5 (v) Generalizability investigation

For the STO cohort, available metadata included MR scanner manufacturer, scanner type, acquisition protocol, and field strength (cf. Supplementary Tables S5, S6). Across these metadata categories, no significant differences in segmentation performance were observed between the main groups (see Supplementary Tables S13S20). This suggests proper generalizability of the trained models despite variability in MRI acquisition protocols.

A visual inspection of cases where the different models provided poor metrics was performed and typical examples have been assembled in Figure 7. Cases A and B illustrate ground truth noise, likely due to semi-automatic annotations that were not fully corrected. Case C shows a postoperative annotation performed on a different tumor than the surgical target, leading to erroneous residual tumor assessment. Case D depicts a residual tumor correctly annotated, yet another tumor region was also fully segmented (i.e., mix of ET and NETC classes). Case E represents a scenario where the residual tumor has been correctly annotated but other contrast-enhancing regions untargeted by surgery were omitted. Together, cases C, D, and E highlight inconsistencies common in multi-purpose datasets, where annotators and protocols vary across institutions and tasks. Regarding model limitations, cases F and G demonstrate pitfalls of skull-stripped inputs, where atypical brain borders caused confusion due to abnormal gradients and contrast. Finally, case H features RC model predictions over a likely old resection cavity not marked in the ground truth, while case I highlights RC model over-segmentation of NETC regions in tumor areas not targeted by surgery. Additional examples can be accessed in the Supplementary Figures S6S8).

Figure 7
Multiple MRI brain scans are displayed, labeled A to I. Each series shows different brain sections, with red-highlighted areas indicating regions of interest. Each series displays a unique reason causing model prediction to fail, extracted from the analysis.

Figure 7. Cases where the models exhibited poor performance metrics, each separated by light blue lines and assigned a blue letter (A-I). In the first two rows, the ground truth is represented in red. In the bottom row, the model predictions are represented with a color map.

For patient-wise classification metrics, a Dice threshold of 0.1% is extremely permissive to categorize true positive cases. As reported in Table 10, both patient-wise recall and precision values decline progressively with stricter thresholds. At a threshold of 25%, a patient-wise recall of 91.51 and precision of 82.72 are still relatively close to the values obtained with the most permissive threshold. Higher thresholds substantially reduce false positives but at the cost of missing many true positives. Yet, using Dice score as detection threshold should be interpreted with care and in relation to object size. For small structures such as residual tumor averaging 5 ml (i.e., 5, 000 voxels), a 50% Dice score does not necessarily indicate poor performance. For example, this could correspond to only 1, 000 voxels being misclassified, often reflecting minor boundary discrepancies rather than clinically meaningful errors.

Table 10
www.frontiersin.org

Table 10. Sensitivity analysis results showing the impact of varying detection thresholds on overall performances for the ET segmentation model using all input sequences.

6 Discussion

This study presents a comprehensive investigation into standardized pre- and post-surgical automatic assessment reporting for central nervous system tumors. First, unified segmentation models for both preoperative contrast-enhancing and non-enhancing tumor core structures were introduced, using the Attention U-Net architecture. Second, postoperative segmentation models for contrast-enhancing residual tumor and resection cavity were developed, leveraging the same architecture with variations in MR sequence inputs. Finally, classification models for MR sequence identification and tumor type differentiation were explored using DenseNet. The primary contribution of this study is achieving state-of-the-art segmentation performance, validated against the latest BraTS challenge results. Building on these models, an automated surgical reporting pipeline was proposed, aligned with the latest RANO 2.0 guidelines. Lastly, all models and methods are integrated into the open-access Raidionics software, providing a standardized reporting solution for clinical use.

The preoperative contrast-enhancing tumor core segmentation dataset includes a large and diverse patient population, covering all major CNS tumor subtypes. For contrast-enhancing tumors, the t1c MR sequence remains the most informative and critical modality. A key limitation, however, is the inconsistent availability of additional MR sequences across patients, except in the BraTS cohort. In clinical practice, multiple sequences (i.e., t1c, t1w, and flair) are essential for optimal treatment planning and prognosis. However, since all four standard sequences are not always available at every center, developing models able to perform well using a single sequence has high practical value. To create unified segmentation models applicable to all CNS tumor types, the tumor core and non contrast-enhancing tumor core structures were considered separately in datasets A and B. This choice reflects the significant under-representation of necrotic and cystic regions in meningiomas and metastases, which would otherwise introduce class imbalance and complexity in a mixed model. For postoperative segmentation of contrast-enhancing residual tumor, incorporating multiple sequences becomes imperative. For example, separating blood products from residual tumor requires looking at both t1c and t1w scans. To enhance contrast differences and reduce confusion, using t1 subtraction (i.e., t1d) as additional input was experimented but did not provide meaningful impact. Alternatively, Diffusion-Weighted Imaging (DWI) and Apparent Diffusion Coefficient (ADC) maps could be considered as input sequences postoperatively in the future. Both sequences could help further disambiguate residual disease, exhibiting restricted diffusion, from hemorrhage and infarction. However, these sequences are not consistently acquired, and their integration would required robust handling of missing modalities. In dataset C where sequence availability varies, increasing required input modalities reduces usable cases. Yet, sufficient data remains to train effective models across all configurations. Dataset D is an exception, containing a higher proportion of non contrast-enhancing cases. Mixing contrast-enhancing and non contrast-enhancing samples supports the goal of unified models. Finally, a resection cavity model operating primarily on FLAIR scans is essential for complete surgical reporting across all major CNS tumor types, especially if t1 sequences are unavailable. To expand training for this configuration, BraTS cases were included, originally segmented on t1c then co-registered to FLAIR. Notably, resection cavities are often undefined for meningiomas, as these tumors lie outside of the brain.

MR sequence classification achieved near-perfect performance, with only 32 misclassifications out of 8, 000 cases. Each error was linked to acquisition issues such as motion or illumination artifacts. Even when correctly classified, scans with severe artifacts may still be unsuitable for reliable segmentation. On the other hand, tumor type classification exhibited limitations, reflected by the lower balanced accuracy of 85%. The decision to use only t1c as input was driven by dataset constraints, restricting the model to morphological cues which are insufficient for robust differentiation. Integrating anatomy-guided inputs or leveraging the high-performing tumor core segmentation model should direct the network's attention to relevant regions and subtle intra-tumoral patterns (e.g., rim thickness, necrosis). Additionally, incorporating complementary sequences, such as FLAIR and T2 when available, could enhance classification performance and enable unified models for both contrast-enhancing and non-enhancing tumors.

For preoperative tumor core segmentation of contrast-enhancing CNS tumors, the unified model outperformed type-specific models, surpassing both our previous results and latest BraTS challenge leaderboards. Using only t1c input does not appear to be a limitation, as performance was on par or higher than BraTS results obtained with all four sequences. Postoperatively, the contrast-enhancing residual tumor segmentation models performed well, achieving an object-wise Dice score of 70% and a HD95 of 4.32 mm. Including t1w alongside t1c improved object-wise Dice by 3%, and adding flair and t2 yielded another 3% gain. However, these gains must be interpreted with caution due to potential biases introduced by varying sample sizes across configurations. Specifically, a direct comparison of performance across input combinations is limited, as each was evaluated on a different subset of patients. Notably, when all four MR sequences were included, the BraTS cohort constituted 60% of dataset C, introducing potential bias toward its distribution. Fortunately, including t1w reduced the sample size by only 30 patients, making this configuration both clinically meaningful and more representative as t1w helps distinguish residual tumor from postoperative blood products. In a previous work (16), an ablation study to assess the isolated impact of each sequence on overall performances was performed, over dataset C excluding the BraTS cohort. The results indicated a similar trend with improved performance from using t1w in addition to t1c as input, while flair and t2w inclusion had limited impact. For resection cavity segmentation, incremental sequence inclusion had minimal effect, with stable object-wise Dice scores around 78%. The specific inclusion of t1d inputs did not have a noticeable impact on overall performances. The use of FLAIR sequences only yielded lower performance, down to 66%. An explanation might come from the type of CNS tumor featured in the different cohorts. Dataset D differs from others by including more non contrast-enhancing tumors, which may explain some variability. The models' classification ability, reflected in the patient-wise metrics, should be interpreted with care. Most cohorts have highly skewed positive-to-negative ratio whereby nearly all patients exhibit the structure of interested to be segmented. As such, low specificity and balanced accuracy values are almost inevitable, strongly influenced by the data distribution. In this context, patient-wise recall and precision provide a more meaningful assessment, particularly from a clinical perspective. Ensuring all patients with tumor are detected (i.e., recall) while minimizing false positives (i.e., precision) is critical. Missing a tumor region carries greater risk than over-segmentation. Specific model training techniques, known to handle better large variations in sample distribution or class imbalance (e.g., bootstrapping, hard-mining), could be investigated. Alternatively, architectures able to perform classification and segmentation at the same time (e.g., Mask R-CNN) could help mitigate this issue.

Postoperative imaging is inherently complex, many structures are fragmented, making reliance on a single metric potentially misleading. Multi-metric validation combining patient-wise, voxel-wise, and object-wise analyses provides a more comprehensive understanding of model performance, capturing both fine-grained segmentation accuracy and clinically relevant detection capabilities. In addition, ground truth inconsistencies are expected in large, multi-institutional datasets making it challenging to achieve pixel-perfect annotations. While such noise is generally tolerable for training robust models, some voxel-wise metrics might be inflated (e.g., HD95 as highlighted in Supplementary Figure S6). Object-wise metrics which discard clinically irrelevant fragments below a predefined threshold, as done in the BraTS challenge, provide a more stable and clinically meaningful assessment. Adaptive ROI-based voxel-wise metrics computation could be investigated, focusing on regions near the preoperative tumor site targeted for surgery.

Analysis of inter-cohort variability reveals several factors influencing residual tumor segmentation performance. Notably, as observed in the per-cohort average volumes, a substantial difference between BraTS and other cohorts exists with structures being nearly twice as large. Such volume discrepancy likely explains the 20% Dice score gap, as larger targets are generally easier to segment. Another key distinction is postoperative imaging timing during patient care. While all other cohorts include only early postoperative scans, BraTS comprises multiple postoperative time points. Scans acquired several months after surgery typically exhibit reduced cavity content and resolution of surgical blood products, making residual tumor boundaries more discernible. These differences underscore the significant variability in MR imaging across institutions and highlight the importance of diverse, multi-institutional datasets for training models with strong generalizability. Balancing data quality and quantity remains particularly challenging for residual tumor segmentation. Additional variability may stem from differences in surgical practices (e.g., intraoperative decisions, extent of resection) and treatment strategies (e.g., surgery, chemotherapy). This interpretation is reinforced by the more consistent performance observed across cohorts for resection cavity segmentation, suggesting that this task is less affected by clinical variability.

A direct and exact comparison with BraTS challenge performance is not feasible due to the unavailability of their test sets and potential differences in design choices. Specifically, the implementation details of metric computation can significantly impact the results, most notably in the pairing strategy to compute object-wise metrics (referred to as lesion-wise in BraTS). Nonetheless, we believe that our own implementation provides a fair and consistent assessment of model performance. A key source of variation lies in the threshold used to determine whether a model prediction is considered correct, either on a patient-wise or object-wise basis. For instance, adopting a very low Dice threshold (e.g., 0.1%) between ground truth and detection favors patient-wise recall but results in lower average Dice scores. Conversely, using a stricter threshold such as 50% Dice aligns more closely with computer vision conventions, yielding improved pixel-wise and object-wise performance but at the cost of reduced sensitivity. While this stricter threshold may reduce the patient-wise recall, it arguably better reflects clinical relevance through qualitative visual agreement. For more robust statistical comparisons across models, limiting the analysis to patients with all four MR sequences would be a more rigorous choice. However in dataset A, excluding several hundred cases would reduce tumor heterogeneity and introduce a bias toward the predominantly American BraTS cohort.

The proposed standardized surgical report offers an almost complete quantitative depiction of key structures for both contrast-enhancing and non contrast-enhancing tumors. Volumetric assessments of the brain, tumor core, non-enhancing tumor core, resection cavity, and FLAIR hyperintensity enable evaluations aligned with RANO 2.0 guidelines for CNS tumors. The inclusion of MR sequence and tumor type classification models in the reporting pipeline also alleviates the burden for the user. The reporting process requires only a folder containing MR acquisitions organized into preoperative and postoperative scans. Segmentation of the most relevant structures in contrast-enhancing and non contrast-enhancing CNS tumors is fully supported. A fine-grained tumor subtype classification or image-based biomarker prediction would be interesting to investigate. Furthermore, incorporating segmentation models for post-operative complications, such as hemorrhages or infarctions, remains an important direction for enhancing clinical utility.

Future work could explore the development of models capable of segmenting all relevant structures simultaneously or incorporating global context refinement mechanisms. Although post-hoc refinement mitigates inconsistencies, it cannot fully resolve volumetric conflicts arising from independent predictions. Implementing a single unified multi-class model could intrinsically enforce anatomical relationships during training. However, it would likely require a highly curated and specific dataset, which remains challenging outside of the BraTS challenge. In addition, severe class imbalance is problematic, for example with the NETC structure not consistently present across all tumor types. Current state-of-the-art strategies in BraTS adopt an additive approach by segmenting the enhancing tumor, tumor core, and whole tumor classes hence bypassing the challenges from the NETC structure. Additionally, another priority is resolving the ambiguity between resection cavities and NETC regions, which often exhibit similar appearances on t1c. Such issue becomes particularly problematic in re-operation scenarios, where preexisting cavities can lead to inaccurate volumetric measurements in standardized preoperative reports. Since preoperative models are not trained to distinguish such occurrences, employing a STAPLE-based fusion approach, incorporating both preoperative and postoperative segmentation masks, may improve overall accuracy. Another promising direction involves the development of multi-input segmentation models able to accommodate variable combinations of MR sequences (38). Data augmentation techniques, such as randomly masking one or more sequences during training, may improve robustness. To further capitalize on the existing datasets, including patients with missing sequences would be beneficial. In this context, the use of generative diffusion models to synthetize absent MR scans also warrants exploration. Lastly, performing clinical validation of the standardized report is essential. In particular, assessing the predictive value of the automatically derived measurements in relation to clinical outcomes, such as survival and quality of life, will be essential for ensuring the clinical utility and adoption of such tools. A similar approach to Majewska et al. (39) for assessing the prognostic value of automatic extent-of-resection computation will be followed.

Data availability statement

The datasets analyzed in this study can not be made publicly available according to the ethical approval for this study. Requests to access the datasets should be directed to ZGF2aWQuYm91Z2V0QHNpbnRlZi5ubw==. Accession codes for the Raidionics environment with all related information is available at https://github.com/raidionics. More specifically, all trained models can be accessed at https://github.com/raidionics/Raidionics-models/releases/tag/v1.3.0-rc, the Raidionics software can be found at https://github.com/raidionics/Raidionics. Finally, the source code used to compute the validation metrics is available at https://github.com/dbouget/validation_metrics_computation.

Ethics statement

The studies involving humans were approved by the Norwegian regional ethics committee (REK ref.: 2013/1348 and 2019/510), the Medical Ethics Review Committee of VU University Medical Center (IRB00002991, 2014.336), and the Boston IRB (number 2023P001681). The studies were conducted in accordance with the local legislation and institutional requirements. The participants provided their written informed consent to participate in this study.

Author contributions

DB: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Software, Validation, Visualization, Writing – original draft, Writing – review & editing. MF: Methodology, Writing – review & editing. AJ: Data curation, Writing – review & editing. FB: Data curation, Writing – review & editing. HA: Data curation, Writing – review & editing. LB: Data curation, Writing – review & editing. MB: Data curation, Writing – review & editing. SH-J: Data curation, Writing – review & editing. JF: Data curation, Writing – review & editing. AI: Data curation, Writing – review & editing. BK: Data curation, Writing – review & editing. GW: Data curation, Writing – review & editing. RT: Data curation, Writing – review & editing. EM: Data curation, Writing – review & editing. PR: Data curation, Writing – review & editing. MW: Data curation, Writing – review & editing. TS: Data curation, Writing – review & editing. PD: Data curation, Funding acquisition, Project administration, Supervision, Writing – review & editing. OS: Funding acquisition, Project administration, Supervision, Writing – review & editing, Data curation, Methodology. IR: Funding acquisition, Project administration, Supervision, Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. DB, MF, IR, and OS are partly funded by the Norwegian National Research Center for Minimally Invasive and Image-Guided Diagnostics and Therapy. PD and FB were supported by an unrestricted grant of Stichting Hanarth fonds, Machine learning for better neurosurgical decisions in patients with glioblastoma; a grant for public-private partnerships (Amsterdam UMC PPP-grant) sponsored by the Dutch government (Ministry of Economic Affairs) through the Rijksdienst voor Ondernemend Nederland (RVO) and Topsector Life Sciences and Health (LSH), Picturing predictions for patients with brain tumors; a grant from the Innovative Medical Devices Initiative program, project number 10-10400-96-14003; The Netherlands Organisation for Scientific Research (NWO), 2020.027; a grant from the Dutch Cancer Society, VU2014-7113; the Anita Veldman foundation, CCA2018-2-17. AJ received grants from the Swedish state. Under the agreement between the Swedish government and the county councils concerning economic support of research and education of doctors (ALF-agreement ALFGBG-1006089).

Acknowledgments

Data were processed in digital labs at HUNT Cloud, Norwegian University of Science and Technology, Trondheim, Norway.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fneur.2025.1707481/full#supplementary-material

References

1. Louis DN, Perry A, Wesseling P, Brat DJ, Cree IA, Figarella-Branger D, et al. The 2021 WHO classification of tumors of the central nervous system: a summary. Neuro Oncol. (2021) 23:1231–51. doi: 10.1093/neuonc/noab106

PubMed Abstract | Crossref Full Text | Google Scholar

2. Day J, Gillespie DC, Rooney AG, Bulbeck HJ, Zienius K, Boele F, et al. Neurocognitive deficits and neurocognitive rehabilitation in adult brain tumors. Curr Treat Options Neurol. (2016) 18:1–16. doi: 10.1007/s11940-016-0406-5

PubMed Abstract | Crossref Full Text | Google Scholar

3. Chen R, Smith-Cohn M, Cohen AL, Colman H. Glioma subclassifications and their clinical significance. Neurotherapeutics. (2017) 14:284–97. doi: 10.1007/s13311-017-0519-x

PubMed Abstract | Crossref Full Text | Google Scholar

4. Jaeckle K, Decker P, Ballman K, Flynn P, Giannini C, Scheithauer B, et al. Transformation of low grade glioma and correlation with outcome: an NCCTG database analysis. J Neurooncol. (2011) 104:253–9. doi: 10.1007/s11060-010-0476-2

PubMed Abstract | Crossref Full Text | Google Scholar

5. Kickingereder P, Burth S, Wick A, Götz M, Eidel O, Schlemmer HP, et al. Radiomic profiling of glioblastoma: identifying an imaging predictor of patient survival with improved performance over established clinical and radiologic risk models. Radiology. (2016) 280:880–9. doi: 10.1148/radiol.2016160845

PubMed Abstract | Crossref Full Text | Google Scholar

6. Mathiesen T, Peredo I, Lönn S. Two-year survival of low-grade and high-grade glioma patients using data from the Swedish Cancer Registry. Acta Neurochirur. (2011) 153:467–71. doi: 10.1007/s00701-010-0894-0

PubMed Abstract | Crossref Full Text | Google Scholar

7. Sawaya R, Hammoud M, Schoppa D, Hess KR, Wu SZ, Shi WM, et al. Neurosurgical outcomes in a modern series of 400 craniotomies for treatment of parenchymal tumors. Neurosurgery. (1998) 42:1044–55. doi: 10.1097/00006123-199805000-00054

PubMed Abstract | Crossref Full Text | Google Scholar

8. Zinn PO, Colen RR, Kasper EM, Burkhardt JK. Extent of resection and radiotherapy in GBM: a 1973 to 2007 surveillance, epidemiology and end results analysis of 21,783 patients. Int J Oncol. (2013) 42:929–34. doi: 10.3892/ijo.2013.1770

PubMed Abstract | Crossref Full Text | Google Scholar

9. Binaghi E, Pedoia V, Balbi S. Collection and fuzzy estimation of truth labels in glial tumour segmentation studies. Comput Methods Biomechan Biomed Eng. (2016) 4:214–28. doi: 10.1080/21681163.2014.947006

Crossref Full Text | Google Scholar

10. Berntsen EM, Stensjøen AL, Langlo MS, Simonsen SQ, Christensen P, Moholdt VA, et al. Volumetric segmentation of glioblastoma progression compared to bidimensional products and clinical radiological reports. Acta Neurochirur. (2020) 162:379–87. doi: 10.1007/s00701-019-04110-0

PubMed Abstract | Crossref Full Text | Google Scholar

11. de Verdier MC, Saluja R, Gagnon L, LaBella D, Baid U, Tahon NH, et al. The 2024 Brain Tumor Segmentation (BraTS) challenge: glioma segmentation on post-treatment MRI. arXiv preprint arXiv:240518368. (2024).

Google Scholar

12. Isensee F, Jaeger PF, Kohl SA, Petersen J, Maier-Hein KH. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods. (2021) 18:203–11. doi: 10.1038/s41592-020-01008-z

PubMed Abstract | Crossref Full Text | Google Scholar

13. Hatamizadeh A, Nath V, Tang Y, Yang D, Roth HR, Xu D. Swin unetr: Swin transformers for semantic segmentation of brain tumors in MRI images. In: International MICCAI Brainlesion Workshop. Springer (2021). p. 272–284. doi: 10.1007/978-3-031-08999-2_22

Crossref Full Text | Google Scholar

14. Warfield SK, Zou KH, Wells WM. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Trans Med Imaging. (2004) 23:903–21. doi: 10.1109/TMI.2004.828354

PubMed Abstract | Crossref Full Text | Google Scholar

15. Ferreira A, Solak N, Li J, Dammann P, Kleesiek J, Alves V, et al. How we won brats 2023 adult glioma challenge? Just faking it! Enhanced synthetic data augmentation and model ensemble for brain tumour segmentation. arXiv preprint arXiv:240217317. (2024).

Google Scholar

16. Helland RH, Ferles A, Pedersen A, Kommers I, Ardon H, Barkhof F, et al. Segmentation of glioblastomas in early post-operative multi-modal MRI with deep neural networks. Sci Rep. (2023) 13:18897. doi: 10.1038/s41598-023-45456-x

PubMed Abstract | Crossref Full Text | Google Scholar

17. Luque L, Skogen K, MacIntosh BJ, Emblem KE, Larsson C, Bouget D, et al. Standardized evaluation of the extent of resection in glioblastoma with automated early post-operative segmentation. Front Radiol. (2024) 4:1357341. doi: 10.3389/fradi.2024.1357341

PubMed Abstract | Crossref Full Text | Google Scholar

18. Bianconi A, Rossi LF, Bonada M, Zeppa P, Nico E, De Marco R, et al. Deep learning-based algorithm for postoperative glioblastoma MRI segmentation: a promising new tool for tumor burden assessment. Brain Inform. (2023) 10:26. doi: 10.1186/s40708-023-00207-6

PubMed Abstract | Crossref Full Text | Google Scholar

19. Cepeda S, Romero R, Luque L, García-Pérez D, Blasco G, Luppino LT, et al. Deep learning-based postoperative glioblastoma segmentation and extent of resection evaluation: Development, external validation, and model comparison. Neuro-Oncol Adv. (2024) 6:vdae199. doi: 10.1093/noajnl/vdae199

PubMed Abstract | Crossref Full Text | Google Scholar

20. Ghaffari M, Samarasinghe G, Jameson M, Aly F, Holloway L, Chlap P, et al. Automated post-operative brain tumour segmentation: a deep learning model based on transfer learning from pre-operative images. Magn Reson Imaging. (2022) 86:28–36. doi: 10.1016/j.mri.2021.10.012

PubMed Abstract | Crossref Full Text | Google Scholar

21. Ermiş E, Jungo A, Poel R, Blatti-Moreno M, Meier R, Knecht U, et al. Fully automated brain resection cavity delineation for radiation target volume definition in glioblastoma patients using deep learning. Radiat Oncol. (2020) 15:1–10. doi: 10.1186/s13014-020-01553-z

PubMed Abstract | Crossref Full Text | Google Scholar

22. Canalini L, Klein J, de Barros NP, Sima DM, Miller D, Hahn H. Comparison of different automatic solutions for resection cavity segmentation in postoperative MRI volumes including longitudinal acquisitions. In: Medical Imaging 2021: Image-Guided Procedures, Robotic Interventions, and Modeling. SPIE (2021). p. 558–564. doi: 10.1117/12.2580889

Crossref Full Text | Google Scholar

23. Pérez-García F, Dorent R, Rizzi M, Cardinale F, Frazzini V, Navarro V, et al. A self-supervised learning strategy for postoperative brain cavity segmentation simulating resections. Int J Comput Assist Radiol Surg. (2021) 16:1653–1661. doi: 10.1007/s11548-021-02420-2

PubMed Abstract | Crossref Full Text | Google Scholar

24. Weinberg BD, Gore A, Shu HKG, Olson JJ, Duszak R, Voloschin AD, et al. Management-based structured reporting of posttreatment glioma response with the brain tumor reporting and data system. J Am Coll Radiol. (2018) 15:767–71. doi: 10.1016/j.jacr.2018.01.022

PubMed Abstract | Crossref Full Text | Google Scholar

25. Bouget D, Alsinan D, Gaitan V, Helland RH, Pedersen A, Solheim O, et al. Raidionics: an open software for pre-and postoperative central nervous system tumor segmentation and standardized reporting. Sci Rep. (2023) 13:15570. doi: 10.1038/s41598-023-42048-7

Crossref Full Text | Google Scholar

26. Parillo M, Quattrocchi CC. Brain tumor reporting and data system (BT-RADS) for the surveillance of adult-type diffuse gliomas after surgery. Surgeries. (2024) 5:764–73. doi: 10.3390/surgeries5030061

Crossref Full Text | Google Scholar

27. Bouget D, Pedersen A, Jakola AS, Kavouridis V, Emblem KE, Eijgelaar RS, et al. Preoperative brain tumor imaging: models and software for segmentation and standardized reporting. Front Neurol. (2022) 13:932219. doi: 10.3389/fneur.2022.932219

PubMed Abstract | Crossref Full Text | Google Scholar

28. Karargyris A, Umeton R, Sheller MJ, Aristizabal A, George J, Wuest A, et al. Federated benchmarking of medical artificial intelligence with MedPerf. Nat Mach Intell. (2023) 5:799–810. doi: 10.1038/s42256-023-00652-2

PubMed Abstract | Crossref Full Text | Google Scholar

29. Moawad AW, Janas A, Baid U, Ramakrishnan D, Saluja R, Ashraf N, et al. The brain tumor segmentation-metastases (brats-mets) challenge 2023: brain metastasis segmentation on pre-treatment MRI. arXiv preprint arXiv:2306.00838. (2024).

Google Scholar

30. LaBella D, Schumacher K, Mix M, Leu K, McBurney-Lin S, Nedelec P, et al. Brain tumor segmentation (brats) challenge 2024: Meningioma radiotherapy planning automated segmentation. arXiv preprint arXiv:2405.18383. (2024).

Google Scholar

31. Wen PY, van den Bent M, Youssef G, Cloughesy TF, Ellingson BM, Weller M, et al. RANO 2.0: update to the response assessment in neuro-oncology criteria for high-and low-grade gliomas in adults. J Clin Oncol. (2023) 41:5187–99. doi: 10.1200/JCO.23.01059

Crossref Full Text | Google Scholar

32. Oktay O, Schlemper J, Folgoc LL, Lee M, Heinrich M, Misawa K, et al. Attention u-net: learning where to look for the pancreas. arXiv preprint arXiv:1804.03999. (2018).

Google Scholar

33. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017). p. 4700–4708. doi: 10.1109/CVPR.2017.243

Crossref Full Text | Google Scholar

34. Faanes M. G., Bouget D., Jakola A. S., Smith T. R., Kavouridis V. K., Latini F. (2025). A unified FLAIR hyperintensity segmentation model for various CNS tumor types and acquisition time points. arXiv preprint arXiv:2512.17566.

Google Scholar

35. Latini F, Fahlström M, Berntsson SG, Larsson EM, Smits A, Ryttlefors M. A novel radiological classification system for cerebral gliomas: the Brain-Grid. PLoS ONE. (2019) 14:e0211243. doi: 10.1371/journal.pone.0211243

PubMed Abstract | Crossref Full Text | Google Scholar

36. Killeen PR. An alternative to null-hypothesis significance tests. Psychol Sci. (2005) 16:345–53. doi: 10.1111/j.0956-7976.2005.01538.x

PubMed Abstract | Crossref Full Text | Google Scholar

37. Cardoso MJ, Li W, Brown R, Ma N, Kerfoot E, Wang Y, et al. Monai: An open-source framework for deep learning in healthcare. arXiv preprint arXiv:2211.02701. (2022).

Google Scholar

38. Eijgelaar RS, Visser M, Müller DM, Barkhof F, Vrenken H, van Herk M, et al. Robust deep learning-based segmentation of glioblastoma on routine clinical MRI scans using sparsified training. Radiology. (2020) 2:e190103. doi: 10.1148/ryai.2020190103

PubMed Abstract | Crossref Full Text | Google Scholar

39. Majewska P, Helland RH, Ferles A, Pedersen A, Kommers I, Ardon H, et al. Prognostic value of manual versus automatic methods for assessing extents of resection and residual tumor volume in glioblastoma. J Neurosurg. (2025) 142:1298–306. doi: 10.3171/2024.8.JNS24415

PubMed Abstract | Crossref Full Text | Google Scholar

Keywords: 3D segmentation, attention U-net, CNS tumor, RADS, reporting

Citation: Bouget D, Faanes MG, Jakola AS, Barkhof F, Ardon H, Bello L, Berger MS, Hervey-Jumper SL, Furtner J, Idema AJS, Kiesel B, Widhalm G, Tewarie RN, Mandonnet E, Robe PA, Wagemakers M, Smith TR, De Witt Hamer PC, Solheim O and Reinertsen I (2026) Automatic and standardized reporting of perioperative MRIs in patients with central nervous system tumors. Front. Neurol. 16:1707481. doi: 10.3389/fneur.2025.1707481

Received: 17 September 2025; Revised: 01 December 2025;
Accepted: 10 December 2025; Published: 06 February 2026.

Edited by:

Zhenyu Gong, Sun Yat-sen University, China

Reviewed by:

Amirmasoud Ahmadi, Max Planck Institute for Biological Intelligence, Germany
Pietro Fiaschi, University of Genoa, Italy
S. Shailesh Nayak, Manipal Academy of Higher Education, India
Adnan Dagcinar, Marmara University, Türkiye
Sathis Kumar, Dhanalakshmi Srinivasan University, India
Vaidehi Satushe, COEP Technological University, India

Copyright © 2026 Bouget, Faanes, Jakola, Barkhof, Ardon, Bello, Berger, Hervey-Jumper, Furtner, Idema, Kiesel, Widhalm, Tewarie, Mandonnet, Robe, Wagemakers, Smith, De Witt Hamer, Solheim and Reinertsen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: David Bouget, ZGF2aWQuYm91Z2V0QHNpbnRlZi5ubw==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.