External validation of a convolutional neural network for the automatic segmentation of intraprostatic tumor lesions on 68Ga-PSMA PET images

Introduction State of the art artificial intelligence (AI) models have the potential to become a “one-stop shop” to improve diagnosis and prognosis in several oncological settings. The external validation of AI models on independent cohorts is essential to evaluate their generalization ability, hence their potential utility in clinical practice. In this study we tested on a large, separate cohort a recently proposed state-of-the-art convolutional neural network for the automatic segmentation of intraprostatic cancer lesions on PSMA PET images. Methods Eighty-five biopsy proven prostate cancer patients who underwent 68Ga PSMA PET for staging purposes were enrolled in this study. Images were acquired with either fully hybrid PET/MRI (N = 46) or PET/CT (N = 39); all participants showed at least one intraprostatic pathological finding on PET images that was independently segmented by two Nuclear Medicine physicians. The trained model was available at https://gitlab.com/dejankostyszyn/prostate-gtv-segmentation and data processing has been done in agreement with the reference work. Results When compared to the manual contouring, the AI model yielded a median dice score = 0.74, therefore showing a moderately good performance. Results were robust to the modality used to acquire images (PET/CT or PET/MRI) and to the ground truth labels (no significant difference between the model’s performance when compared to reader 1 or reader 2 manual contouring). Discussion In conclusion, this AI model could be used to automatically segment intraprostatic cancer lesions for research purposes, as instance to define the volume of interest for radiomics or deep learning analysis. However, more robust performance is needed for the generation of AI-based decision support technologies to be proposed in clinical practice.

Introduction: State of the art artificial intelligence (AI) models have the potential to become a "one-stop shop" to improve diagnosis and prognosis in several oncological settings. The external validation of AI models on independent cohorts is essential to evaluate their generalization ability, hence their potential utility in clinical practice. In this study we tested on a large, separate cohort a recently proposed state-of-the-art convolutional neural network for the automatic segmentation of intraprostatic cancer lesions on PSMA PET images.
Methods: Eighty-five biopsy proven prostate cancer patients who underwent 68 Ga PSMA PET for staging purposes were enrolled in this study. Images were acquired with either fully hybrid PET/MRI (N = 46) or PET/CT (N = 39); all participants showed at least one intraprostatic pathological finding on PET images that was independently segmented by two Nuclear Medicine physicians. The trained model was available at https://gitlab.com/dejankostyszyn/prostategtv-segmentation and data processing has been done in agreement with the reference work.
Results: When compared to the manual contouring, the AI model yielded a median dice score = 0.74, therefore showing a moderately good performance. Results were robust to the modality used to acquire images (PET/CT or PET/MRI) and to the ground truth labels (no significant difference between the model's performance when compared to reader 1 or reader 2 manual contouring).
Discussion: In conclusion, this AI model could be used to automatically segment intraprostatic cancer lesions for research purposes, as instance to define the volume of interest for radiomics or deep learning analysis. However, more

Introduction
Prostate cancer (PCa) is the second most common cancer in men, with 1,414,259 new cases in 2020, accounting for 15.1% of all cancer diagnoses within the male population (1). Although histopathological examination of prostate biopsy cores is required for the diagnosis of PCa, imaging is pivotal to characterize the disease (2). Multiparametric (mp)-MRI has been used for years in clinical practice to guide biopsy and to drive the clinical management of PCa patients (2).
PSMA PET has been recently added to the EAU-ESTRO-SIOG guidelines for staging high-risk PCa (2) in view of its higher sensitivity compared to mp-MRI (3,4). Therefore, a possible next step will be to use PSMA PET to diagnose clinically significant PCa (5)(6)(7)(8) and to perform quantitative analysis that might allow for a better and more objective characterization of the disease (9-11).
Accurate contouring of intraprostatic gross tumor volume (GTV) is mandatory for an accurate assessment of PCa in several clinical settings, including both biopsy guidance and radiomic features extraction. However, this procedure is time consuming and largely affected by the experience of the contouring physicians, often resulting in non-reproducible segmentations (12).
Recently, there has been a surge in the development of artificial intelligence (AI) models in the medical field, with the first tools being already available for use (13,14). Convolutional neural networks (CNN) have been shown to accurately segment medical images (15)(16)(17) and hold the potential to improve intraprostatic tumor delineation (18)(19)(20)(21). The use of CNN in this setting could improve GTV definition by reducing the inter-reader variability while saving time by automating this task.
Kostyszyn and colleagues were the first to develop a CNN for the automatic segmentation of intraprostatic cancer lesions on PSMA (using both 68 Ga-and 18 F-PSMA) PET images (18). They used 152 patients examined at two centers (Germany and China) to train their model and a cohort composed by 57 patients to test it. However, only 20 patients in the testing cohort were studied at an external institution (center 3, Germany) not used for training, making it difficult to draw conclusions regarding the model's generalizability.
External validation of AI models on independent cohorts is necessary to assess with certainty their robustness and reproducibility, hence their possible application in clinical practice (22). Therefore, this study aims to evaluate the performance of the CNN for the automatic segmentation of intraprostatic cancer lesions on 68 Ga-PSMA PET images that was previously presented in (18) and that is publicly available at https://gitlab.com/ dejankostyszyn/prostate-gtv-segmentation.

Patients
All patients with biopsy proven PCa who underwent 68 Ga-PSMA PET at IRCCS San Raffaele Scientific Institute from June 2020 to January 2022 for staging purposes were considered for inclusion. A total of 124 patients was identified. Eligibility criteria were: (1) age greater than 18 years at the time of the PET examination (0 patients excluded), (2) presence of at least one intraprostatic pathological finding at 68 Ga-PSMA PET (30 patients excluded), (3) absence of neoadjuvant treatments prior to imaging (9 patients excluded). Eighty-five patients met the inclusion criteria and were included for analysis. See Figure 1 for a flowchart showing the patients' selection process. Prostate specific antigen (PSA) level and the International Society of Urological Pathology (ISUP) grade were collected. This retrospective study was approved by the Institutional Ethics Committee of IRCCS San Raffaele Scientific Institute, and informed consent was waived due to the retrospective nature of the study.
Fasting condition was requested on the day of 68 Ga-PSMA PET/MRI and PET/CT scan.
PET scans were acquired from the skull base to mid-thigh (5-6 FOVs, 4 min/FOV), and started approximately 60 min (mean ± SD, 63 ± 6 min) after injection of 111-273 MBq (Mean ± SD, 168 ± 33 MBq) of 68 Ga-PSMA. PET images, acquired with either PET/MRI or PET/CT scanner, were reconstructed using fully 3D ordered subset expectation-maximization (OSEM) algorithm, time-of-flight (TOF) and point-spread-function (PSF). 68 Ga PSMA PET image read-out was performed by two Nuclear Medicine physicians on an Advantage Workstation (AW, General Electric Healthcare, Waukesha, WI, USA) and the presence of 68 GA-PSMA intraprostatic increased uptake was considered positive for malignancy.

Image segmentation
Two Nuclear Medicine physicians manually contoured the GTV on every slice of 68 GA-PSMA PET images using 3D Slicer (Slicer; version 4.11.2) being aware of all the available patients' clinical and imaging information. The first reader (Exp 1) Frontiers in Medicine 02 frontiersin.org Flowchart illustrating the patients' selection process.
delineated the GTV using an inverted gray scale for display, windowed with SUVmin-max: 0-5, as previously described in Kostyszyn at al. (18). To ensure that the segmentation approach used in the reference work was not introducing any bias, a second reader (Exp 2), instead, contoured images independently without any fixed thresholding of voxel values, blind to any instruction on how images were evaluated in the reference work of Kostyszyn et al. Additionally, two radiologists performed a manual contouring of the prostatic gland on CT and MRI scans by using 3D Slicer (Slicer; version 4.11.2). Since it is not always feasible to discriminate between prostatic tissue and bladder signal in 68 Ga-PSMA PET images, only contouring within the delinated prostatic gland were used for analyses, as described in Kostyszyn et al.

Resampling
To ensure that the CNN's performance in this study was not affected by discrepancies in the methods used as compared to the reference work, resampling and preprocessing of the images was performed exactly as described by Kostyszyn et al. (18).
Specifically, all PET images (nearly raw raster data format, nrrd) were resampled to standardize the voxel spacing to 2.0 mm × 2.0 mm × 2.0 mm using SimpleITK (version1.2.4) since the PET images collected with PET/MRI scanner had original voxel size = 3.125 mm × 3.125 mm × 2.780 mm, while the original voxel size of images acquired with PET/CT scanner was 2.734 mm × 2.734 mm × 3.270 mm. Prostate and GTV segmentations were also resampled to a voxel size of 2.0 mm × 2.0 mm × 2.0 mm. PET volumes were resampled using both tri-linear interpolation and B-spline interpolation, whereas Nearest Neighbour interpolation was used to resample segmentation contours. All data were cropped using the manual contouring of the prostate gland as guidance to a size of 64 × 64 × 64 voxels, and then normalized with σ where x i is the PET data for patient i, and x and σ are the arithmetic mean and the standard deviation calculated over the entire cropped PET training dataset.

Convolutional neural network
The model consists of 3 down sampling steps performed by 2 × 2 × 2 max-pooling along the contracting path, and 3 upsampling steps performed by 2 × 2 × 2 transpose convolutions with padding of 1 and stride of 2 along the expanding paths. Skip connections from the contracting path are concatenated with their corresponding up-sampled feature maps. There are 14 3 × 3 × 3 convolutional layers in total, having stride and padding of 1. Each convolution is followed by batch normalization and ReLU activation function. The last layer in the model performs a 1 × 1 × 1  convolution with no padding, followed by batch normalization and sigmoid activation function. The whole script of the trained CNN can be freely downloaded at https://gitlab.com/dejankostyszyn/ prostate-gtv-segmentation.

Statistical analysis
Statistical analyses were performed with R statistical software (23). Dice score coefficient (DSC) was computed to estimate the performance of the trained CNN (GTV-CNN) presented in Kostyszyn et al. (18). Moreover, DSC was also used to quantitatively assess the agreement between the GTVs manually segmented by the different experts (GTV-Exp 1, GTV-Exp 2). As PET volumes in the dataset have been acquired using two different modalities, PET/MRI and PET/CT, Student's t-test was carried out to determine whether the image modality of acquisition possibly affected the model performance. Student's t-test was also employed to determine whether there was a statistically significant difference in CNN performance across the different GTV-Exp segmentations and to study whether the volume predicted by the CNN was different in size as compared to those manually delineated by experts. Ground truth PCa lesion volumes (GTV-Exp) were correlated with DSC scores using Pearson correlation. Finally, to investigate the impact of different interpolation algorithms, analyses were first conducted on PET images resampled using trilinear interpolation and then on PET volumes resampled with B-spline interpolation. The obtained DSC were compared by means of Student's t-test. P values lower than 0.05 were considered statistically significant.

Patients
Eighty-five patients with biopsy proven PCa were enrolled in this study. The median age was 68 years (range: 45-85 years), whereas the median PSA level was 7.82 ng/ml. Patients' characteristics are reported in Table 1

CNN performance
Analyses were performed on PET volumes resampled with tri-linear interpolation and then repeated on images resampled using B-spline interpolation. The results based on tri-linear interpolation are reported here, while Supplementary Table 1 contains the results using B-spline interpolation for voxel resampling. The trained CNN, when validated on the lesion volumes manually contoured by the first reader (GTV-Exp 1), reached a median DSC = 0.74 (range: 0.07-0.93). When the ground truth label was drawn without fixed thresholding of voxel values by the second reader (GTV-Exp 2), the CNN obtained a median DSC = 0.69 (range: 0.07-0.96). However, this difference was not statistically significant (P value > 0.05). Using tri-linear or B-spline interpolation did not affect model's performance (P value > 0.05). See Table 2 for a detailed description of CNN model performance, and Figure 3 for a representative image. To better show the performance of the  CNN, additional segmentation results for sequential 68 Ga-PSMA-PET slices are shown in Figure 4. Moreover, no statistically significant differences were identified in the volumes of the intraprostatic tumor lesions defined by the expert Nuclear Medicine physicians and those predicted by the CNN (P value > 0.05, Table 3). The DSC obtained by comparing the PCa lesion contouring manually defined by the two expert Nuclear Medicine physicians was 0.73 (range: 0.25-0.92).
No statistically significant differences in CNN performance between PET/MRI and PET/CT images, regardless of the method used to visualize and contour PET images (P value > 0.05 for both GTV-Exp 1 and GTV-Exp 2) were observed. Conversely, a positive correlation was found between DSC and GTV-Exp (r = 0.43, P value < 0.001 and r = 0.44, P value < 0.001 for GTV-Exp 1 and GTV-Exp 2, respectively), meaning that the CNN produced more accurate segmentations for bigger lesions.

Discussion
In the present work, an external validation of a CNN for the automatic segmentation of intraprostatic cancer lesions on 68 Ga-PSMA PET images previously presented by Kostyszyn and colleagues (18) has been performed. In our cohort, the trained CNN model reached a median DSC = 0.74 and its performance was independent from the imaging technique, PET/MRI or PET/CT, used to acquire PET images. 68 Ga-PSMA PET is widely used for the characterization of PCa in different settings and has been recently included into the EAU-ESTRO-SIOG guidelines for high-risk PCa staging (2). Several studies have been reported showing the potential utility of quantitative features extracted from 68 Ga-PSMA PET images for the characterization of the disease (9-11). Considering the role of PSMA PET, a possible forthcoming application might be its use in the diagnosis of clinically significant PCa, including biopsy guidance in patients with equivocal mp-MRI findings (6,24).  Accurate contouring of intraprostatic GTV is required as the starting point both for biopsy guidance and for radiomic analysis. However, this procedure is extremely time consuming and affected by inter-reader heterogeneity, often resulting in nonreplicable segmentations (12). Several CNNs have already been proposed for GTV segmentation in other oncological settings (19)(20)(21), bearing the potential to become a "one-stop shop" for improving the diagnostics and prognostics of various tumors, including PCa (25).
Kostyszyn and colleagues were the first to generate a CNN for the automatic segmentation of intraprostatic cancer lesions on PSMA PET images (18). This study was a joint effort of 3 different Institutions, 2 in Germany and 1 in China. The generated model was trained on 152 patients, employing images acquired with different tomographs in different centers (1 in Germany and 1 in China). However, only 20 patients in the testing cohort were studied at an external institution (center 3, Germany) not used for training, limiting conclusions regarding the model's generalizability.
Validation of AI models in external, independent cohorts is crucial to assess their robustness and, consequently, their potential utility. In our study, we tested the model generated by Kostyszyn and colleagues on a cohort of 85 patients examined with 68 Ga-PSMA PET at our Institution. Considering that image pre-processing can affect the model performance, as previously described in Kostyszyn et al. (18), all pre-processing steps were performed in agreement with the reference work. However, in the present study, images were independently reviewed by two Nuclear Medicine physicians. The first one (Exp 1) followed the instruction given in Kostyszyn et al. (18), while the second (Exp 2) was not informed on how images were viewed in the reference work, thus avoiding the introduction of any bias relative to the adopted segmentation method. The trained CNN model achieved a moderately good performance on our cohort, reaching at best a median DSC = 0.74. Interestingly, results were independent of the modality used to acquire the images, despite the model being originally trained only on PET/CT images, as well as of the windowing of voxel values used when defining the ground truth labels. These results suggest that using images acquired with several different PET/CT scanners for training contributed to increasing model robustness. Moreover, it has been shown that the thresholding of voxel values SUVmin-max: 0-5 yields relatively stable contouring, as also reported in a previous work of the same group. (12). However, the CNN performance was affected by the volume of the ground truth labels (GTV-Exp 1 and GTV-Exp 2), resulting in more accurate segmentations for bigger lesions.
The main limitation of this study is its monocentric nature, as PET images were acquired in a single Institution. However, as our center was not included in the reference work of Kostyszyn et al., our population represents a large independent and external testing cohort. Moreover, we included patients examined both with PET/CT (N = 39) or PET/MRI (N = 46), this could have potentially affected the results, but also allowed the comparison of model performance on images acquired with different modalities. Posthoc analyses showed that no statistically significant differences in CNN performance was observed on images acquired with either PET/MRI or PET/CT. Nineteen patients studied with 18 F-PSMA were included in the paper presented by Kostyszyn et al. All patients considered in this work underwent 68 Ga-PSMA PET, therefore, future studies are needed to assess the model's generalizability to 18 F-PSMA PET findings.
In conclusion, the trained and publicly available CNN model presented by Kostyszyn et al. (18) yields fairly accurate contouring of intraprostatic cancer lesions on 68 Ga-PSMA PET images that could be used as a starting point for quantitative analysis using radiomics or deep learning approaches. Nonetheless, more robust performance is needed for the generation of AI-based decision support technologies that can be used and exploited in daily clinical practice.

Data availability statement
Data supporting the conclusions of this article will be made available by the corresponding author upon reasonable request.

Ethics statement
The studies involving human participants were reviewed and approved by Ethic Committee of IRCCS San Raffaele Scientific Institute. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.