Preoperative Brain Tumor Imaging: Models and Software for Segmentation and Standardized Reporting

For patients suffering from brain tumor, prognosis estimation and treatment decisions are made by a multidisciplinary team based on a set of preoperative MR scans. Currently, the lack of standardized and automatic methods for tumor detection and generation of clinical reports, incorporating a wide range of tumor characteristics, represents a major hurdle. In this study, we investigate the most occurring brain tumor types: glioblastomas, lower grade gliomas, meningiomas, and metastases, through four cohorts of up to 4,000 patients. Tumor segmentation models were trained using the AGU-Net architecture with different preprocessing steps and protocols. Segmentation performances were assessed in-depth using a wide-range of voxel and patient-wise metrics covering volume, distance, and probabilistic aspects. Finally, two software solutions have been developed, enabling an easy use of the trained models and standardized generation of clinical reports: Raidionics and Raidionics-Slicer. Segmentation performances were quite homogeneous across the four different brain tumor types, with an average true positive Dice ranging between 80 and 90%, patient-wise recall between 88 and 98%, and patient-wise precision around 95%. In conjunction to Dice, the identified most relevant other metrics were the relative absolute volume difference, the variation of information, and the Hausdorff, Mahalanobis, and object average symmetric surface distances. With our Raidionics software, running on a desktop computer with CPU support, tumor segmentation can be performed in 16–54 s depending on the dimensions of the MRI volume. For the generation of a standardized clinical report, including the tumor segmentation and features computation, 5–15 min are necessary. All trained models have been made open-access together with the source code for both software solutions and validation metrics computation. In the future, a method to convert results from a set of metrics into a final single score would be highly desirable for easier ranking across trained models. In addition, an automatic classification of the brain tumor type would be necessary to replace manual user input. Finally, the inclusion of post-operative segmentation in both software solutions will be key for generating complete post-operative standardized clinical reports.


INTRODUCTION
Prognosis in patients with brain tumors is heterogeneous with survival rates varying from weeks to several years depending on the tumor grade and type, and for which most patients will experience progressive neurological and cognitive deficit (1). Brain tumors can be classified as either primary or secondary. In the former, tumors originate from the brain itself or its supporting tissues whereas in the latter cancer cells have spread from tumors located elsewhere in the body to reach the brain (i.e., brain metastasis). According to the World Health Organization classification of tumors (2), primary brain tumors are graded by histopathological and genetic analyses and can be regrouped in 100 different subtypes with frequent to relatively rare occurrences. Among the most frequent subtypes, tumors arising from the brain's supportive cell population (i.e., glial tissue) are referred to as gliomas. The more aggressive entities are labeled as high-grade gliomas (HGGs) and are graded between 3 and 4, while the less aggressive entities are referred to as diffuse lower grade gliomas (LGGs) and are graded between 2 and 3. Tumors arising from the meninges, which form the external membranous covering the brain, are referred to as meningiomas. Aside from the aforementioned large categories, other and less frequent tumor types exist (e.g., in the pituitary, sellar, or pineal regions). Each tumor category has distinct biology, prognosis, and treatment (3,4). The most common primary malignant brain tumor type in adults is high-grade glioma which remains among the most difficult cancers to treat with a limited 5-year overall survival (5).
For patients affected by brain tumors, prognosis estimation and treatment decisions are made by a multidisciplinary team (including neurosurgeons, oncologists, and radiologists), and based on a set of preoperative MR scans. High accuracy in the preoperative diagnostics phase is of utmost importance for patient outcomes. Judgments concerning the complexity or radicality of surgery, or the risks of postoperative complications hinge on data gleaned from MR scans. Additionally, tumorspecific characteristics such as volume and location, or cortical structures profile can to a large degree be collected (6). Retrospectively, such measurements can be gathered from the analysis of surgical cohorts, multicenter trials, or registries in order to devise patient outcome prediction models (7)(8)(9). Reliable measurements and reporting of tumor characteristics are, therefore, instrumental in patient care. Standard reporting and data systems (RADSs) have been established for several solid tumors such as prostate cancer (10) and lung cancer (11). Very few attempts have been made for brain cancer in general (12) or high-grade gliomas (13). The main goal of RADSs is to provide rules for imaging techniques, terminology of reports, definitions of tumor features, and treatment response to reduce practice variation and obtain reproducible tumor classification. A broad implementation can facilitate collaborations and stimulate evaluation for the development and improvement of RADSs.
Currently, the lack of standardized and automatic methods for tumor detection in brain MR scan represents a major hurdle toward the generation of clinical reports incorporating a wide range of tumor characteristics. Manual tumor delineation or assessment by radiologists is time-consuming and subject to intra and inter-rater variations that are difficult to characterize (14) and, therefore, rarely done in clinical practice. As a result, informative tumor features (e.g., location or volume) are often estimated from the images solely based on crude measuring techniques (e.g., eyeballing) (15).

Related Study
From the fast-growing development in the field of deep learning, convolutional neural networks have demonstrated impressive performance in various segmentation tasks and benchmark challenges, with the added-value of being fully automatic and deterministic (16). Regarding brain tumor segmentation, performances have specifically been assessed on the Brain Tumor Segmentation Challenge (BraTS) dataset (17,18). Occurring every year since 2012, the challenge focuses on gliomas (i.e., HGGs and LGGs) and has reached a notable cohort size with a total of 2,040 patients included in the 2021 edition, and multiple MR sequences included for each patient (i.e., T1c, T1w, T2, FLAIR). Segmentation performance has been assessed using the Dice similarity coefficient and the 95th percentile Hausdorff distance (HD95) as metrics (19). The current stateof-the-art is an extension of the nnU-Net architecture (20) with an asymmetrical number of filters between the encoding and decoding paths, the substitution of all batch normalization layers by group normalization, and the addition of axial attention (21). An average Dice score of 85% together with a 17.70 mm HD95 were obtained for the enhancing tumor segmentation task in high-grade gliomas. The segmentation of other brain tumor types has been sparsely investigated in the literature in comparison, possibly due to a lack of open-access annotated data, as illustrated by recent reviews or studies investigating brain tumor segmentation in general (22,23). Grovik et al. used a multicentric and multi-sequence dataset of 165 metastatic patients to train a segmentation model with the DeepLabV3 architecture (24,25). The best segmentation results were around 79% Dice score with 3.6 false positive detections per patient on average. Other prior studies have focused on using variations of the DeepMedic architecture (26), using contrast-enhanced T1weighted MRI volumes as input, to train their segmentation models (27,28). Datasets were of a similar magnitude with around 200 patients. However, in both cases the test sets were limited to up to 20 patients, making it difficult to assess the generalization ability of the trained models in the absence of cross-validation studies. Obtained average Dice scores over the contrast-enhancing tumor were approximating 75%, with almost 8 false positive detections per patient. From a recent review on the use of machine learning applied to different meningioma-related tasks using MRI scans (29), more than 30 previous studies have investigated automatic diagnosis or grading but only a handful focused on the segmentation task. In addition, the datasets' magnitude used for segmentation purposes has been consistently smaller than for the other tasks, with barely up to 126 patients in the reported studies. Laukamp et al. reported the best Dice scores using well-known 3D neural network architectures such as DeepMedic and BioMedIA, though at the expense of heavy preprocessing techniques the likes of atlas registration (30,31). In a previous study, we achieved equally promising performance using an attention-based U-Net architecture, reaching an average Dice score of up to 88% on contrast-enhanced T1-weighted MRI volumes (32). In addition, the cross-validation studies performed over up to 600 patients with a wide range of tumor sizes, coming from the hospital and the outpatient clinic, exhibited a proper ability to generalize from the trained models.
To summarize, with the exception of the BraTS challenge, there is a dearth of high-quality MRI datasets for brain tumor segmentation. Furthermore, open-access pretrained models and inference codes are scarce and can be cumbersome to operate, hence hindering the generation of private datasets for brain tumor segmentation tasks. On the other hand, open-source tools are being developed to assist in image labeling and the generation of AI models for clinical evaluation, such as MONAI Label (33) or Biomedisa (34). Yet, they do not integrate nor provide access to the latest and highest performing brain tumor segmentation models from the literature, or provide only semiautomatic methods hence requiring manual inputs from the user. From a validation standpoint, the focus has been on reporting Dice scores and often Hausdorff distances, while many other meaningful and possibly more relevant metrics exist and could be investigated to better highlight the strengths and weaknesses of the different segmentation methods (35,36).
The literature on RADSs for brain tumors is equally scarce with only few attempts for preoperative glioblastoma surgery (13) or post-treatment investigation (37). In the former, automatic segmentation and computation of relevant tumor features were provided, and an excellent agreement has been shown between characteristics computed over the manual and automatic segmentation. In the latter, the interpretation of the post-treatment MR scans was provided using a structured set of rules but deprived of any automatic tumor segmentation or image analysis support.

Contributions
While research is exceedingly ahead for glioma segmentation under the aegis of the BraTS challenge community, the segmentation of meningiomas and metastases is trailing behind. In addition, validation studies in the literature have too often been dominated by Dice score reporting and a broader inspection is essential to ensure clinical relevance. Finally, the outcome of this research is often not readily available, especially for the intended end-users who are clinicians without programming experience. As such, the contributions of our study are: (i) the training of robust segmentation models for glioblastomas, lower grade gliomas, meningiomas, and metastases assessed using a panel of more than 20 different metrics to better highlight performance, (ii) the development of two software solutions enabling easy and fully automatic use of the trained models and tumor features computation: Raidionics and Raidionics-Slicer, and (iii) open-access models and source code for the software and validation metrics computation.

DATA
For this study, four different datasets have been assembled, one for each main tumor type considered: glioblastoma, lower grade glioma, meningioma, and metastasis. The tumor type was assessed at the time of surgery, when applicable, following the currently applicable guidelines (i.e., either (38) or (39)). Tumors were manually segmented in 3D by trained raters using as support either a region growing algorithm (40) or a grow cut algorithm (41,42), and subsequent manual editing. Trained raters were supervised by neuroradiologists and neurosurgeons. On contrast-enhanced T1-weighted scans, the tumor was defined as gadolinium-enhancing tissue including non-enhancing enclosed necrosis or cysts. On FLAIR scans, the tumor was defined as the hyperintense region. The four datasets are introduced in-depth in the subsequent sections. An overall summary of the data available is reported in Table 1, and some visual examples are provided in Figure 1.

Glioblastomas
The glioblastoma dataset is made of a total of 2,134 Gdenhanced T1-weighted MRI volumes originating from 14 different hospitals, and one public challenge.

Meningiomas
The meningioma dataset is made of 719 Gd-enhanced T1weighted MRI volumes, mostly built around a dataset previously introduced (43)

Metastases
The metastasis dataset is made of a total of 396

METHODS
First, the process for automatic brain tumor segmentation including data preprocessing, neural network architecture, and training design is introduced in Section 3.1. Second, the tumor characteristics extraction process, using the generated tumor segmentation as input, is summarized in Section 3.2. Finally, a description of the two developed software solutions for performing segmentation and standardized reporting is given in Section 3.3.

Tumor Segmentation
The architecture selected to train segmentation models for each brain tumor type is AGU-Net, which has shown to perform well on glioblastoma and meningioma segmentation (32,44). In the following, the different training blocks are presented with some inner variations specified by roman numbers inside brackets. A global overview is provided in Table 2 summarizing used variants. Architecture: Single-stage approach leveraging multi-scale input and deep supervision to preserve details, coupled with a single attention module. The loss function used was the class-averaged Dice loss, excluding the background. The final architecture was as described in the original article with 5 levels and [16,32,128,256, 256] as convolution blocks.
Preprocessing: The following preprocessing steps were used: 1. resampling to an isotropic spacing of 1 mm 3 using spline interpolation of order 1 from NiBabel 1 . 2. (i) tight clipping around the patient's head, excluding the void background, or (ii) skull-stripping using a custom brain segmentation model. 3. volume resizing to 128 × 128 × 144 voxels using spline interpolation of order 1. 4. intensity normalization to the range [0, 1].
Training strategy: Models were trained using the Adam optimizer over a batch size of 32 samples with accumulated gradients (actual batch size 2), stopped after 30 consecutive epochs without validation loss improvement, following either: (i) training from scratch with 1e −3 initial learning rate, or transfer learning with an initial learning rate of 1e −4 fine-tuning over the best glioblastoma model.
For the data augmentation strategy, the following transforms were applied to each input sample with a probability of 50%: horizontal and vertical flipping, random rotation in the range [−20 • , 20 • ], and translation up to 10% of the axis dimension.
Training protocol: Given the magnitude difference within our four datasets, two different protocols were considered: (i) a three-way split at the hospital level whereby MRI volumes from one hospital constituted the validation fold; MRI volumes from a second hospital constituted the test fold; and the remaining MRI volumes constituted the training fold. As such, each hospital was used in turn as the test set in order to properly assess the ability of the different models to generalize. Or (ii) a 5-fold cross-validation with a random two-way split over MRI volumes whereby four 1 https://github.com/nipy/nibabel folds are used in turn as a training set and the remaining one as a validation set, without the existence of a proper separate test set.

Preoperative Clinical Reporting
For the generation of standardized preoperative clinical reports in a reproducible fashion, the computation of tumor characteristics was performed after alignment to a standard reference space. As described in-depth in our previous study (13), the reference space was constituted by the symmetric Montreal Neurological Institute ICBM2009a atlas (MNI) (45). The atlas space did not possess any brain average as FLAIR sequence, the T1 atlas file was used for all tumor types.
For each tumor type, the collection of features includes volume, laterality, multifocality, cortical structure location profile, and subcortical structure location profile. Specifically tailored for glioblastomas, resectability features are, therefore, not available for the other brain tumor types.

Proposed Software
In order to make our models and tumor features easily available to the community, we have developed two software solutions. The first one is a stand-alone software called Raidionics, and the second one is a plugin to 3D Slicer given its predominant and widespread use in the field (46). Both solutions provide access to a similar back-end including inference and processing code. However, the GUI and intended user interactions differ. The trained models are stored in a separate online location and are downloaded on the user's computer at runtime. Models can be improved over time and a change will be automatically detected, resulting in the replacement of outdated models to the user's machine. Raidionics can be seen as an improved solution to our initial GSI-RADS software, covering not only glioblastomas but all major brain tumor types, offering the option to compute a similar standardized report, and providing a refined graphical user interface enabling the user to visually assess the results.

Stand-Alone Solution: Raidionics
The software proposes two modes: (i) single-use where only one patient is to be processed and results can be visually assessed in the 2D viewer, and (ii) batch-mode whereby a collection of patients can be processed sequentially without any viewing possibility. In each mode, the option is left to the user to solely perform tumor segmentation or to compute the whole set of tumor characteristics and generate the standardized report. For each patient, the software expects an MRI scan as input (i.e., Gd-enhanced T1-weighted or FLAIR sequence) and the tumor type must be manually selected. Additionally, a preexisting tumor segmentation mask can be provided to bypass the automatic segmentation, if collecting the tumor characteristics is the main interest and manual annotations have been performed beforehand. The total set of processed files saved on disk includes the standardized reports, brain and tumor segmentation masks in both patient and MNI space, cortical and subcortical structures masks in both patient and MNI space, and the registration files to navigate from patient to MNI space. To complement the reporting and give the possibility for follow-up statistical studies, the complete set of computed features is also provided in comma separated value format (i.e., .csv).
The software has been developed in Python 3.6.9, using PySide2 v5.15.2 for the graphical user interface, and only uses the Central Processing Unit (CPU) for the various computations. The software has been tested and is compatible with Windows (≥ 10), macOS (≥ Catalina 10.15), and Ubuntu Linux (≥ 18.04). An illustration of the software is provided in Figure 2. Crossplatform installers and source code are freely available at https:// github.com/dbouget/Raidionics.

3D Slicer Plugin: Raidionics-Slicer
The 3D Slicer plugin has been developed using the DeepInfer plugin as baseline (47) and is mostly intended for tumor segmentation purposes. Through a slider, the possibility is provided to manually alter the probability threshold cutoff in order to refine the proposed binary mask. Further manual editing can be performed thereafter using the existing 3D Slicer functionalities. The back-end processing code has been bundled into a Docker image for convenience, and therefore, administrator rights are required for the end-user to perform the installation locally. The same inputs, behavior, and outputs can be expected as for the stand-alone software.
The GitHub repository for the 3D Slicer plugin can be found at https://github.com/dbouget/Raidionics-Slicer, and an illustration is provided in Figure 3.

VALIDATION STUDIES
In the validation studies, only the automatic segmentation performances are assessed. The clinical validity and relevance of the extracted tumor features have been addressed thoroughly in a previous study (13). To better grasp the different aspects of the segmentation performance, a wider set of metrics is studied as described in Section 4.1. For the voxel-wise segmentation task, only two classes are considered as the whole tumor extent (including contrast-enhancing regions, cysts, and necrosis) is the target: non-tumor tissue or tumor tissue. In that sense, a positive voxel is a voxel exhibiting tumor tissue, whereas a negative voxel is a voxel exhibiting background or normal tissue.

Metrics
Following a review of metrics for evaluating 3D medical image segmentation (36), a broad spectrum of 25 metrics was selected, computed either voxel-wise or instance-wise, FIGURE 2 | Illustration of the Raidionics software after generating the standardized report for a patient suffering from glioblastoma. The left side presents the tumor characteristics belonging to the report, whereas the right side offers a simplistic view. and grouped according to the following categories: overlapbased, volume-based, information theory-based, probabilistic, and spatial distance-based.

Voxel-Wise:
For quantifying semantic segmentation performance, we have selected the following metrics computed directly and indiscriminately over all voxels of a given patient MRI volume: 1. Overlap-based: (i) True Positive Rate (TPR), also called recall or sensitivity, is the probability that an actual positive voxel will test positive; (ii) True Negative Rate (TNR), also called specificity, is the probability that an actual negative voxel will test negative; (iii) False Positive Rate (FPR), is the probability that a false alarm will be raised (i.e., a negative voxel will test positive); (iv) False Negative Rate (FNR), also called missed rate, is the probability that a true positive voxel will test negative; (v) Positive Predictive Value (PPV), also referred to as precision, is the ratio of truly positive voxels over all voxels which tested positive; (vi) Dice score (Dice), also called the overlap index and gauging the similarity of two samples, is the most commonly used metric in validating medical volume segmentation ( (v) Probabilistic Distance (PBD), as a measure of the distance between fuzzy segmentation (57). 5. Spatial-distance-based: (i) 95th percentile Hausdorff distance (HD95), measuring the boundary delineation quality (i.e., contours). The 95% version is used to make measurements more robust to small outliers (58); (ii) the Mahalanobis distance (MHD), measuring the correlation of all points and calculated according to the variant described for the validation of image segmentation (59); (iii) Average Symmetric Surface Distance (ASSD), as the average symmetric surface distance between the binary objects in two images.

Instance-Wise:
For quantifying instance detection performance, we chose the following metrics, reported in a patient-wise fashion (PW) or an object-wise fashion (OW). In the latter, and in case of multifocal tumors, each focus is considered as a separate tumor.
The detection threshold has been set to 0.1% Dice to determine whether an automatic segmentation is eligible to be considered as a true detection or a false positive.

Measurements
Pooled estimates, computed from each fold's results, are reported for each measurement (62). Overall, measurements are reported as mean and SD (indicated by ±) in the tables. Voxel-wise: For semantic segmentation performance, the Dice score is computed between the ground truth volume and a binary representation of the probability map generated by a trained model. The binary representation is computed for ten different equally-spaced probability thresholds (PT) in the range [0, 1].
Instance-wise: For instance detection performance, a connected components approach coupled with a pairing strategy was employed to associate ground truth and detected tumor parts. A minimum size threshold of 50 voxels was set and objects below that limit were discarded. A detection was deemed true positive for any Dice score strictly higher than 0%.

Experiments
To validate the traied models, the following set of experiments was conducted: 1. Overall performance study: k-fold cross-validation studies for the different tumor types for assessing segmentation performance. For easy interpretation, only Dice scores together with patient-wise and object-wise recall, precision, and F1-score values are reported.
2. Metrics analysis: in-depth performance comparison using the additional metrics, and confusion matrix computation between the metrics to identify redundancy in their use. 3. Representative models selection: identification of one final segmentation model for each tumor type, which will be made available for use in our software solutions. 4. Speed study: computation of the pure inference speed and the total elapsed time required to generate predictions for a new patient, obtained with CPU support and reported in seconds. The operations required to prepare the data to be sent through the network, initialize the environment, load the trained model, and reconstruct the probability map in the referential space of the original volume are accounted for. The experiment was repeated ten consecutive times over the same MRI volume for each model, using a representative sample of each dataset in terms of dimension and spacing.

Implementation Details
Results were obtained using a computer with the following specifications: Intel Core Processor (Broadwell, no TSX, IBRS) CPU with 16 cores, 64GB of RAM, Tesla V100S (32GB) dedicated GPU and a regular hard-drive. Training and inference processes were implemented in Python 3.6 using TensorFlow v1.13.1, and the data augmentation was performed using the Imgaug Python library (63). The metrics were for the most part computed manually using the equations described in the Supplementary Material, or alternatively using the sklearn v0.24.2 (64) and medpy v0.4.0 (65) Python libraries. The source code used for computing the metrics and performing the validation studies is made publicly available at https://github. com/dbouget/validation_metrics_computation.

Overall Performance Study
An overall summary of brain tumor segmentation performance for all four tumor subtypes is presented in Table 3. Meningiomas and lower grade gliomas appear more difficult to segment given average Dice scores of 75%, compared to average Dice scores of 85% for glioblastomas and metastases. A similar trend, yet with a slightly smaller gap, can be noted for the Dice-TP scores ranging between 81 and 90% with a standard deviation of approximately 15%, indicating the quality and relative stability of the trained models. From a patient-wise perspective, those results demonstrate the difficulty of achieving good recall while keeping the precision steadily above 95%. Even though a direct comparison to the literature is impossible since different datasets have been used, obtained performance is on-par if not better than previously reported performances where Dice scores have been ranging from 75 to 85%. Regarding the lower grade glioma tumor subtype, the diffuse nature of the tumors and less pronounced gradients over image intensities are possible explanations for the lower segmentation performance. For the meningioma category, the reason for the lower Dice-score and recall values can be attributed to the larger number of small tumors (< 2 ml) compared to other subtypes. In addition, outliers have been identified in this dataset whereby a small extent of the tumors were either partly enhanced  Figure 4 for 10 equally-sized bins. For meningiomas, four bins are necessary to group tumors with a volume of up to 4 ml while only one bin is necessary for the glioblastomas, indicating a volume distribution imbalance between the two types. The diamond-shaped points outside the boxes represent cases where the segmentation model did not perform well (cf. Supplementary Figures S1-S4).
While tumor volumes and outlier MR scans are reasons for the discrepancy in Dice and recall values across the board, precision is rather unaffected and more stable. The nature of the convolutional neural network architecture and training strategy used can explain those results. By leveraging volumes covering the full brain, global relationships can be learned by the trained model hence reducing the confusion between tumor regions and other contrast-enhancing structures such as blood vessels. Given GPU memory limitation, the preprocessed MR scans have undergone significant downsampling, and as such small tumors are reduced to very few voxels, impacting mainly recall performance.
Finally, an average decrease of ∼ 10% can be noticed between patient-wise and object-wise detection metrics, whereby satellite tumors are on average an order of magnitude smaller than the main tumor, and are hence more prone to be omitted or poorly segmented by our models. Segmentation performance is illustrated in Figure 5. Each row corresponds to one tumor type and each column depicts a different patient.

Metrics Analysis
Side-by-side voxel-wise performances regarding the overlapbased metrics are reported in Tables 4, 5. Unsurprisingly, given the good precision performance and the absence of patients without a tumor, both TNR and its opposite FPR scores are almost perfect for all tumor types. Similarly, the TPR and its opposite FNR metrics are scoring similarly to Dice. Within each tumor category, the overlap-based metrics are following the same trend whereby a higher average Dice score would correlate with a higher score for any other metrics and vice versa (e.g., IoU). An exception can be made regarding the behavior of the GCE metric, scoring on average higher for glioblastomas than for meningiomas and as such not following the same pattern as Dice. Upon careful visual inspection, the GCE metric seems to be extremely sensitive to outliers, either coming from the image quality or manual ground truth correctness (cf. top row in Supplementary Figures S1-S4). Given the non-normalized state of the GCE metric, and its absence of any upper bound, an extremely poor agreement between manual ground truth and automatic segmentation will result in score orders of magnitude higher than its average expression over a given dataset. Regarding the two volume-based metrics, featured rightmost in the second table, an antagonistic pattern toward Dice can be observed. The VS metric has the same cross-type trend as Dice with similar yet slightly greater scores. On the other hand, while the RAVD metric scores best over the metastasis group similar to Dice, its worst average value is obtained for the glioblastoma group, hence potentially exhibiting the same frailty toward outliers as for the GCE metric.
Next off, voxel-wise performance for information theorybased and probabilistic metrics are regrouped in Table 6. The MI and VOI metrics, both based on information theory, are exhibiting an inverse behavior in line with observations about the relationship between Dice and GCE metrics. The normalized mutual information metric ranges from 0.668 to 0.829 for Dice scores between 75 and 87%, showcasing stability but also correlation. On the contrary, the VOI metric expresses a behavior concurrent to GCE whereby the worst performance is obtained for the lower grade gliomas and then glioblastomas categories, while it performs best over metastases where Dice also scores the highest. Alike the aforementioned metric groups exhibiting inner discrepancies, three of the five probabilistic metrics follow a similar trend scoring high alongside Dice, with an average gap of 0.1 corresponding to a ∼ 10% Dice score difference. Meanwhile, the PBD metric has a behavior of its own scoring order of magnitude worse for the meningioma category than for the three other subtypes. The metric is not normalized and an extremely poor agreement between the manual ground truth and automatic segmentation would result in an extremely large score, similar to the GCE metric, hence reporting the median score, in addition, might be of interest (cf. second row in Supplementary Figures S1-S4).
Finally, the voxel-wise distance-based metrics are reported in Table 7. Similar cross-type trends can also be noted whereby the best HD95 of 4.97 mm is obtained for the glioblastoma category and the worst HD95 of 10 mm for meningiomas, heavily correlated to Dice performance. Our average HD95 results appear lower than previously reported results in the literature, however, a strong statement can hardly be made as the tumors featured can vary highly in terms of volume and number of satellites which might reflect greatly on metrics' average scores. The other two spatial distance-based metrics display similar behavior to HD95, whereby tumor types can be ranked as follows based on best to worse performance: glioblastoma, metastasis, lower grade glioma, and meningioma.
Regarding instance-wise metrics, grouped in Table 8, the close OASSD average values between glioblastomas and meningiomas represent the most surprising outcome given the 5% difference in F1-score. Unsurprisingly, the lower grade glioma category achieves the highest average OASSD with 2.6 mm together with the lowest F1-score. As one might expect, the amount of FPPP correlates greatly with the average precision values obtained. Ultimately, the ARI metric generates scores extremely similar to voxel-wise Dice and correlates highly with the F1score whereby the glioblastoma and metastasis categories obtain almost 0.1 more than for the meningioma and lower grade glioma subtypes.
For completeness, the correlation between the different metrics computed in this study has been assessed, and the results over the glioblastoma category are shown in Table 9 (cf. other correlation matrices in Supplementary Tables S2, S4, S6, S8). Some metrics have been excluded given inherent correlation from their computation, such as FPR and FNR being the opposite of TNR and TPR. Similarly, metrics having computation in a voxel-wise, patient-wise, or instance-wise fashion were not considered in the matrix (i.e., recall, precision, and F1-score). Overall, the conclusions identified by analyzing the raw average results are further confirmed whereby a majority of voxel-wise metrics correlate with one another and thus do not bring any additional information to Dice. However, relevant insight can be obtained from the RAVD and GCE/VOI metrics given their low correlation to Dice and their higher sensitivity toward outliers, enabling to quantify the ability to generalize the model or potentially the quality of the data and manual ground truth (cf. third row in Supplementary Figures S1-S4). The correlation between HD95 and MHD appears also quite low for spatial distance-based metrics, indicating potential usefulness. Finally, in the instance-wise category, the OASSD is a stand-alone metric offering to properly assess model performance over the detection of satellite tumors. To conclude, a final pool of metrics to consider for benchmarking purposes and capturing all aspects of the segmentation performances are Dice, RAVD, VOI, HD95, MHD, and OASSD. Given the task, reporting patient-wise and FIGURE 5 | Examples of segmentation performances. One row illustrates one tumor type: glioblastoma, lower grade glioma, meningioma, metastasis (from top to bottom), and each column depicts a different patient. The manual delineation is shown in red, the automatic segmentation in blue, and the patient-wise Dice score in white.
instance-wise recall, precision, and F1-score is always of interest because of an innate comprehension of their meaning, easy to interpret for clinicians or other end-users.

Representative Models Selection
Only one model can be provided in the software solutions for each tumor type, and the best model selection was done empirically according to the following criterion: the size of the validation or test set, average Dice score, and patient-wise F1-score performances. The exhaustive list of chosen models is the following: the model trained for fold 0 was selected for the glioblastomas, the model trained for fold 3 was selected for the lower grade gliomas, for the meningiomas the model trained for fold 2 was selected, and finally for the metastases the model trained for fold 2 was selected.

Speed Study
A comparison in processing speed regarding pure tumor segmentation and complete generation of standardized reports is provided in Table 10 when using the Raidionics software with CPU support. The high-end computer is the computer used for training the models, whereas the mid-end computer is a Windows laptop with an Intel Core Processor (i7@1.9GHz), and 16GB of RAM.
For the smallest MRI volumes on average, 17 s are needed to perform tumor segmentation whereas 4.5 min are required to generate the complete standardized report with the high-end computer. Unsurprisingly, the larger the MRI volume the more time required to perform the different processing operations (cf. Supplementary Section S3). For the largest MRI volumes overall, 54 s are needed to perform tumor segmentation whereas 15.1 min are required to generate the complete standardized report. When using the mid-end laptop, overall runtime is increased by 1.5 times for the different MRI volume sizes. On average, 9 min are necessary to generate the standardized report for MRI volumes of reasonable quality.

DISCUSSION
In this study, we have investigated the segmentation of a range of common main brain tumor types in 3D preoperative MR scans using a variant of the Attention U-Net architecture. We have conducted experiments to assess the performances of each trained model using close to 30 metrics and developed two software solutions for end-users to freely benefit from our segmentation models and standardized clinical reports. The main contributions are the high performances of the models, on-par with performances reported in the literature for the glioblastomas, with illustrated robustness and ability to generalize due to the multiple and widespread data sources. In addition, the two proposed open-access and open-source software solutions include our best models, together with a RADS for computing tumor characteristics. This is the first open RADS solution that supports all major brain tumor types. The software is user-friendly, requiring only a few clicks and no  The color intensity of each cell represents the strength of the correlation, where blue denotes direct correlation and red denotes inverse correlation. programming to use, making it easily accessible for clinicians. The overall limitations are those already known for deep learning approaches whereby a higher amount of patients or data sources would improve the ability to generalize, boost segmentation performances, and increase the immunity toward rare tumor expressions. The employed architecture also struggles with smaller tumors given the large downsampling to feed the entire 3D MR scan in the network, hence the need for a better design combining local and global features either through multiple steps or ensembling. The architecture and training strategy used in this study were identical to our previously published work considering that the intent was not to directly make advances on the segmentation task. Nevertheless, the stability and robustness to train efficient models had been documented, alongside performance comparison to another well-known architecture [e.g., nnU-Net (20)], thus not precluding its use to train models for other brain tumor types. Aside from evident outliers in the datasets, where either tumors with partial or missing contrast uptake or suboptimal manual annotations were identified, the major pitfall of using the AGU-Net architecture lies in its struggle to segment equally satisfactorily small tumor pieces with a volume below 2 ml. Overall, the glioblastoma model is expected to be the most robust and able to generalize since patient data from 15 different sources were used. For other models trained on data from much fewer hospitals, with an expected limited variability in MR scan quality, their robustness is likely to be inferior. While larger datasets are often correlated with improved segmentation performance, the metastasis model is the best performing with the lowest amount of patients included. The relative easiness of the task from a clear demarcation of the tumor from surrounding normal tissue in contrast-enhanced T1weighted volumes, and the potentially low variance in tumor characteristics with patient data coming from two hospitals only, can explain the results. Additionally, the metastasis model has been trained by transfer-learning using as input the second best performing glioblastoma model where the most data was used, which may have been the compelling factor. Lower-grade gliomas represent the most difficult type to manually segment since tumors are diffuse and infiltrating with an average volume in FLAIR sequences a lot higher than in T1 sequences for the other tumor types, and as such overall worse performances were expected.
The in-depth assessment of a larger pool of metrics allowed us to identify redundancy and uniqueness and proved that the Dice score is overall quite robust and indicative of expected performance. However, the sole use of the Dice score cannot cover all aspects of model performance, and spatial distancebased metrics (e.g., HD95 and MHD) are suggested to be used in conjunction as providing values uncorrelated to Dice. In addition, some metrics were identified to be more sensitive to outliers and are as such powerful to either assess the ability to generalize a model across data acquired on different scanners from multiple sources or quickly identify potential issues in a large body of data. Finally, and depending on the nature of the patients included in one's study and the number of satellite tumors, specific object-wise metrics are imperative to use (e.g., OASSD). Only a combination of various metrics computed either voxel-wise, patient-wise, or instance-wise can give the full picture of a model's performance. Unfortunately, interpreting and comparing sets of metrics can prove challenging, and as such further investigations regarding their merging into a unique informative and coherent score are fundamental [e.g., Roza (66)]. Furthermore, an inadequacy lies in the nature of the different metrics whereby some can be computed across all segmentations generated by a trained model, whereas others are exclusively eligible on true positive cases, i.e., when the model has correctly segmented to some extent of the tumor. For models generating perfect patient-wise recall, all metrics will be eligible for every segmentation. However, in this field of research and as of today, no trained model can fulfill this requirement due to the substantially large inter-patient variability. Ideally, the identification of relevant metrics, bringing unique information for interpreting the results, should not be confined to the validation studies. More metrics should be considered to be a part of the loss function computation during the training of neural network architectures. Attempts have been made toward using the Hausdorff distance as a loss function, but a direct minimization is challenging from an optimization viewpoint. For example, approximation of Hausdorff distance based on distance transforms, on morphological operations, or with circular and spherical kernels showed potential for medical image segmentation (67). In general, a careful mix between losses (e.g., Dice, cross-entropy, and HD95) is challenging to achieve and adaptive strategies might be required to avoid reaching a local minimum where overall segmentation performance may suffer (68).
As a current trend in the community, inference code and trained segmentation models are often at best available on GitHub repositories. As a consequence, only engineers, or people with some extent of knowledge in machine learning and programming, can benefit from such research advances. Besides, the research focus is heavily angled toward gliomas, due to the BraTS challenge influence, whereby segmentation models are expected to yield superior performance than for meningiomas and metastases. By developing and giving free and unrestricted access to our two proposed software solutions, we hope to facilitate more research on all brain tumor types. Willing research institutes have the opportunity to generate private annotated datasets at a faster pace than through fully manual labor by exploiting our trained models. Having made all source code available on GitHub, as customarily done, we made the effort to further make stand-alone solutions with easyto-use GUIs. Hopefully, clinicians and other non-programming end-users should feel more comfortable manipulating such tools, available across the three major operating systems and necessitating only a computer with average hardware specifications. For the generation of standardized clinical reports, the computation of tumor characteristics relies heavily on the quality of the automatic segmentation, occasional mishaps are expected as models are not perfect and can omit the tumor. Therefore, manual inputs will be required sporadically to correct the tumor segmentation. Over time, new and better models will be generated and made available seamlessly into the two software through regular updates. For the time being, support for AGU-Net models only is provided due to its lighter codebase compared to nnU-Net, for similar overall performances. From a software bundling and deployment perspective, integrating a heavier inference framework and mixing backend engines (i.e., TensorFlow and Torch) will make it more challenging to create stable executables for Raidionics on Mac, Windows, and Ubuntu. Support for other architectures will be considered if new models clearly outperform the current models.
In the future, an approach incorporating a set of metrics and converting them into one final score would be highly desirable (e.g., Roza). Not only would it help to automatically select the best model from a k-fold validation study from one unique score, but a proper assessment and ranking across multiple methods would be enabled. With all preoperative brain tumor types available for segmentation and reporting in our software, a key missing component is the automatic tumor type classification to supplement manual user input. Concurrently, the variety and amount of tumor characteristics to compute should be extended, considering more type-specific features similar to the resection index for glioblastomas. Alternatively, bringing a similar focus on post-operative segmentation of residual tumors is of great interest to both assess the quality of the surgery and refine the estimated patient outcome. The generation of a complete post-operative standardized clinical report would also be permitted with new features such as the extent of resection. Otherwise, intensifying the gathering of patient data from more widespread hospital centers and a larger array of MRI scanners is always of importance. The inclusion of more than one MR sequence per patient as segmentation input has the potential to boost overall performance, but at the same time might reduce models' potency as not always routinely available across all centers worldwide.

CONCLUSION
Efficient and robust segmentation models have been trained on pre-operative MR scans for the four main brain tumor types: glioblastoma, lower grade glioma, meningioma, and metastasis. In-depth performance assessment allowed to identify the most relevant metrics from a large panel, computed either voxel-wise, patient-wise, or instance-wise. Trained models and standardized reporting have been made publicly available and packaged into a stand-alone software and a 3D Slicer plugin to enable effortless widespread use.

DATA AVAILABILITY STATEMENT
The datasets presented in this article are not readily available because access to them is restricted under strict General Data Protection Regulation (GDPR) regulations.
Requests to access the datasets should be directed to DB, david.bouget@sintef.no.

ETHICS STATEMENT
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. The patients/participants provided their written informed consent to participate in this study. IR, OS, PD, KE, and AJ: funding acquisition. AJ, KE, VK, IK, DB,  HA, FB, LB, MB, MC, JF, SH-J, AI, BK, AK, EM, DM, PR, MR,  TS, WV  were not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.