Clinical evaluation on automatic segmentation results of convolutional neural networks in rectal cancer radiotherapy

Purpose Image segmentation can be time-consuming and lacks consistency between different oncologists, which is essential in conformal radiotherapy techniques. We aimed to evaluate automatic delineation results generated by convolutional neural networks (CNNs) from geometry and dosimetry perspectives and explore the reliability of these segmentation tools in rectal cancer. Methods Forty-seven rectal cancer cases treated from February 2018 to April 2019 were randomly collected retrospectively in our cancer center. The oncologists delineated regions of interest (ROIs) on planning CT images as the ground truth, including clinical target volume (CTV), bladder, small intestine, and femoral heads. The corresponding automatic segmentation results were generated by DeepLabv3+ and ResUNet, and we also used Atlas-Based Autosegmentation (ABAS) software for comparison. The geometry evaluation was carried out using the volumetric Dice similarity coefficient (DSC) and surface DSC, and critical dose parameters were assessed based on replanning optimized by clinically approved or automatically generated CTVs and organs at risk (OARs), i.e., the Planref and Plantest. Pearson test was used to explore the correlation between geometric metrics and dose parameters. Results In geometric evaluation, DeepLabv3+ performed better in DCS metrics for the CTV (volumetric DSC, mean = 0.96, P< 0.01; surface DSC, mean = 0.78, P< 0.01) and small intestine (volumetric DSC, mean = 0.91, P< 0.01; surface DSC, mean = 0.62, P< 0.01), ResUNet had advantages in volumetric DSC of the bladder (mean = 0.97, P< 0.05). For critical dose parameters analysis between Planref and Plantest, there was a significant difference for target volumes (P< 0.01), and no significant difference was found for the ResUNet-generated small intestine (P > 0.05). For the correlation test, a negative correlation was found between DSC metrics (volumetric, surface DSC) and dosimetric parameters (δD95, δD95, HI, CI) for target volumes (P< 0.05), and no significant correlation was found for most tests of OARs (P > 0.05). Conclusions CNNs show remarkable repeatability and time-saving in automatic segmentation, and their accuracy also has a certain potential in clinical practice. Meanwhile, clinical aspects, such as dose distribution, may need to be considered when comparing the performance of auto-segmentation methods.


Introduction
Preoperative radiotherapy is currently considered to be the standard treatment for locally advanced rectal cancer and has been proven to reduce local recurrence (1)(2)(3).With the development of radiotherapy technology, such as intensity modulated radiotherapy (IMRT) and volumetric modulated arc therapy (VMAT), the target volume can receive a highly conformal dose distribution (4).In addition, it has been proven that IMRT and VMAT are dosimetrically superior to other conformal techniques in protecting organs at risk (OARs) in rectal cancer (5).Thus, the accurate delineation of the clinical target volume (CTV) and OARs is crucial for treatment planning in rectal cancer.
Interobserver differences occur during manual delineation, which depend on oncologists' clinical experience, resulting in significant changes in dose distributions (6).Multiple studies have applied deep learning methods to automatic segmentation to solve the problem of time consumption and the lack of consistency in manual contouring (7-9).Based on the planning computed tomography (pCT) images, the oncologists' delineated regions of interest (ROIs) as a training set.These structures are imported into deep learning models with CT images, and their corresponding features are extracted to train models according to the framework characteristics of different models.
The accuracy of automatic segmentation requires clinical evaluation.Objective evaluation metrics such as the volumetric Dice similarity coefficient (volumetric DSC) and Hausdorff distance (HD) are widely used, and some studies have carried out dosimetry assessments (10)(11)(12).However, clinical evaluation of the quality of deep learning delineation has limitations (13).Considering the different accuracy requirements of CTVs and OARs in clinical practice, it is necessary to combine their clinical importance and tolerant errors and carry out a comprehensive evaluation from the perspectives of geometry and dosimetry.
We carried out a retrospective study of radiotherapy patients with rectal cancer.CTV and OARs were segmented manually as the ground truth (GT), two convolutional neural networks we have trained-DeepLabv3+ and ResUNet-were used for automatic delineation (13), and a common method Atlas-based Auto segmentation (ABAS) was used as a comparison.Our research aimed to explore the clinical impact of auto-segmentation results from a dosimetric perspective.

Patient data
The retrospective study was approved by the ethics committee of West China Hospital in 2019, with no extra health risks and no need for patient consent.Rectal cancer patients from February 2018 to April 2019 at West China Hospital were chosen randomly and metastases in advanced patients were ignored.Each patient was immobilized in a supine position with arms over the head using a radiotherapy thermoplastic mold, and this position was applied during simulation and treatment.The contrast-enhanced CT images (tube voltage, 120 kVp; matrix size, 512 × 512; voxel resolution, 0.9 × 0.9 × 3.0 mm in left-right, antero-posterior and cranio-caudal directions) were acquired as patient pCT on the same CT scanner (SOMATOM Definition AS, Siemens Healthcare).
Based on the pCT, a radiation oncologist manually segmented the CTV and OARs of rectal cancer by referring to Radiation Therapy Oncology Group consensus guidelines.Then, the structures were modified and approved by a senior expert physician and labelled as ground truth, including CTV GT , bladder GT , small intestine GT , left femoral head GT and right femoral head GT .

Automatic segmentation
DeepLabv3+ and ResUNet, two typical CNNs, were used for automating delineation.DeepLabv3+ employs an atrous spatial pyramid pooling module and concat aggregation for the extraction and integration of high-level features, and ResUNet has shortcut connections for each level of features (13).The pCT images of all patients were imported into the models to obtain the mask of each structure on every CT slice.The twodimensional masks were then converted to a three-dimensional structure in DICOM format, imported to the treatment planning system (TPS), and labelled as ROI DeepLabv3+ and ROI ResUNet , respectively.The network models were uploaded onto github (https://github.com/hujunjiescu/DeepRadiology_rectum),and the architecture diagram was shown in Figure 1.
CNN models had the same training settings.The contouring tasks were worked out based on the Pytorch deep learning framework using Python.The experiments were performed on a Linux operating system workstation with the CPU Intel Xeon E5-2620 v3@ 2.4GHz, GPU NVIDIA Tesla K40 Xp, and 64 GB RAM.The loss function for the optimization was the weighted crossentropy, which was defined as: where N, C, w, y, and a denoted the batch size, number of classes, weight factor, ground truth sets, and prediction sets, respectively.The batch size N was set at 10, the weight factor w at 2, and the total training epoch T at 100.The stochastic gradient descent method was used to optimize the network with the initial learning rate set as 0.01, which was multiplied by (1 − t T ) 0:9 for the epoch t.The segmentation results were rewritten into DICOM RT structure (RS) files based on their original spatial resolutions.
The ABAS worked on CT datasets using a multi-patient atlas.We randomly selected 5 atlas patients from the CNN training set, then their pCT images and manual contoured structures were imported to ABAS software (Version 2.01.00,Elekta CMS, Inc.).The Simultaneous Truth And Performance Level Evaluation (STAPLE) algorithm was used to fuse the multiple single-subject atlas auto-segmentation sets into one multi-subject autosegmentation set (6).

Treatment plans
To evaluate the clinical dosimetry value of the two automatic delineation methods, a two-round optimization protocol was performed using TPS (Raystation, version 4.7.5, Raystation Laboratories, Stockholm, Sweden): 1) The corresponding PTV was obtained based on CTV expanded with a three-dimensional margin of 5 mm; 2) The dose prescription was set to 50.4Gy/28 fraction to the PTV; 3) Two full arcs, one from 181 to 180°c lockwise and the other from 180 to 181°counterclockwise, were designed using the VMAT technique and 6 MV photons; 4) The initial optimization parameters applied to the first round VMAT planning were shown in Table 1, and in the second round, the weight of Parameter 4 was set to 100, and the weight of Max EUD objectives was set to 0.01.
Taking the difficulty of CTV delineation into consideration, we divided the results of the three auto-segmentation methods into CTV and OAR groups and introduced them as optimization objectives separately to obtain Plan test : 1) the plan optimized using CTV DeepLabv3+ and OAR GT labelled Plan1; 2) the plan optimized using CTV ResUNet and OAR GT labelled Plan2; 3) the plan optimized using CTV ABAS and OAR GT labelled Plan3; 4) the plan optimized using CTV GT and OAR DeepLabv3+ labelled Plan4; 5) the plan optimized using CTV GT and OAR ResUNet labelled Plan5; and 6) the plan optimized using CTV GT and OAR ABAS labelled Convolutional neural network architecture diagram: (A) architecture of DeepLabv3+ for segmentation, and (B) architecture of ResUNet for segmentation.The two networks had been implemented using data sets from 199 patients (training set with 98 cases, validation set with 38 cases, and test set with 63 cases) in previous study (13).
Plan6.In addition, we obtained a Plan ref optimized using ROI GT , calculated characteristic dose parameters of ROI GT in all plans, and compared parameters extracted from Plan test with those from the Plan ref respectively.

Evaluation metrics and statistical analysis
In terms of geometry, two DSCs were used to evaluate quantitatively, which were calculated on the overlap of the ROI structures.The ROIs were converted from RS files to thresholding masks, and the masks were divided into slices corresponding to the CT images.The mathematical operations were carried out based on all mask slices of a certain structure and the average was obtained as DSC value.
The volume similarity was usually evaluated by volumetric DSC, calculated using: where V 1 were the ROIs of ground truth set and V 2 were the corresponding auto-segmentation structures.Volumetric DSC varies between 0 (no overlap) and 1 (complete overlap), which indicates the degree of overlap between ROI GT and autosegmentation results.
To characterize the proportion of the contour edges that need to be redrawn, the surface DSC was applied to assess the agreement between just the surface of two contours (14).The surface DSC represents the proportion of units with acceptable distance, and the acceptable tolerance t was defined as the 95th percentile of the distance difference contoured by two oncologists for each structure.As shown in Figure 2, the masks of two ROIs to be compared were used to extract the contour surface and labelled S, then the surface border was expanded by t both inside and outside to get B(t).Currently, the parts of one structure's S that does not coincide with the other structure's B(t) were regarded as exceeding the tolerance.
In addition, the clinical practicability of contours delineated automatically is evaluated by the accuracy of the dose distribution in plan design.The characteristic dosimetry parameters of ROI GT in every plan were extracted for comparison.D2 (Dn representing the absolute dose of n% volume) and D98 were extracted to signify hot spots and cold spots for all structures, respectively (4).V50.4 for CTV indicated whether CTV received enough dose, while the conformal index (CI ¼ TV RI TV Â TV RI V RI , where TV is target volume, TV RI is the target volume covered by the 95% prescription dose, and V RI is the volume of the 95% prescription dose) and homogeneity index (HI ¼ D 2 -D 98 D 50 ) were used to evaluate PTV (15,16).Moreover, relative dose parameters Vn (volume percentage receiving radiation ≥ n Gy) related to acute or late toxicity of OARs were obtained for all plans, including V30, V40, V50 of the bladder, V15, V45, V50 of the small intestine, and V40, V45 of the femoral head (17-20).
The collected data were analyzed using SPSS Statistics software (version 26.0, SPSS Inc., Chicago, IL, United States).Normality tests were performed on all datasets of geometric and dosimetric parameters.Paired samples t tests or Wilcoxon signed rank tests were chosen for group comparison with statistical significance set at P value< 0.05 (2-tailed).To make a more intuitive comparison, we calculated the absolute difference between the dose parameters extracted from the Plan ref and the Plan test , denoted as D Abs , and carried out a statistical description.In particular, Vn, HI, and CI were relative values and directly subtracted, while Dn were absolute values and converted to normalized dose difference (dDn = jD n test plan -D n reference plan j prescription dose Â 100 %) (21,22).In addition, we used the Pearson test to check whether volumetric and surface DSC were correlated and explore whether the geometric metrics of a structure were correlated with its corresponding dose parameters, and the degree of linear correlation.

Results
Forty-seven rectal cancer patients were included in the study.The median age was 54 years, with a interquartile range (IQR) of 13.97, and other characteristics are shown in Table 2.For patients diagnosed with stage IV disease, the study ignored metastases in the training and evaluation.In 5 cases, the structures were not successfully generated from the ABAS software.The statistical analysis results of volumetric and surface DSC are shown in Figure 3.In general, the volumetric and surface DSC of the three automatic segmentation results were significantly different, except for the surface DSC of the bladder structure delineated by DeepLabv3+ and ResUNet (P = 0.78).For CTV, DeepLabv3+ showed the best performance on volumetric DSC (mean = 0.96, P< 0.01) and surface DSC (mean = 0.96, P< 0.01).The DSCs of CNNs automatic contouring bladder were significantly higher than those of ABAS (P< 0.01), although there was no significant difference in surface DSC between the two CNNs, the mean value of ResUNet was slightly higher than that of DeepLabv3 + (Bladder DeepLabv3+ vs. Bladder ResUNet , 0.82 vs. 0.85, P = 0.78).In the delineation of the small intestine, DeepLabv3+ showed significant advantages, whose mean DSCs (volumetric DSC, 0.91; surface DSC, 0.62) were greater than those of the other two groups (P< 0.01), as well as a lower standard deviation (volumetric DSC, 0.05; surface DSC, 0.10).For the segmentation of the right and left femoral head, ABAS achieved the best performance, then the ResUNet, and DeepLabv3+ ranked the last based on the volumetric DSC (mean for the right femoral head, ABAS vs. ResUNet vs. DeepLabv3+, 0.94 vs. 0.85 vs. 0.84, P< 0.01) and surface DSC (mean for the right femoral head, ABAS vs. ResUNet vs. DeepLabv3+, 0.84 vs. 0.70 vs. 0.67, P< 0.01), despite more outliers in volumetric DSC of ABAS.The ground truth and the automatic delineation of a random case were shown in Figure 4.
The D Abs of the dose parameters between Plan test and Plan ref were shown in Table 3, and it contained descriptive statistics and results of statistical tests.We observed a statistical difference in dose distribution of real CTV between the Plan ref and Plan 1-3 (P< 0.01), which used automatically delineated CTV as an inverse optimization parameter.The difference, however, was that some dose parameters of real OAR in Plan1 were not significantly different from the Plan ref (P > 0.05), which showed a similar trend to the outperformance of DeepLabv3+ in geometric evaluation.When we introduced the auto-segmentation OAR groups into the inverse plan, we found that only critical dose parameters of the small intestine between Plan5 and the Plan ref had no statistical difference (P > 0.05), but the small intestine generated by ResUNet was not optimal in the geometric assessment (volumetric DSC, mean = 0.84; surface DSC, mean = 0.52).
Although the volumetric and surface DSCs of all structures were numerically different, there was a correlation between them (P< 0.01).The correlation analysis results of geometric metrics and dose parameters were shown in Table 4.The volumetric and surface DSCs of CTV generated by ResUNet were correlated with all dose parameters in target volume (P< 0.05) in Plan2, on the other hand, the volumetric DSC of the three CTV groups were correlated with dD95, dD98, HI, and CI (P< 0.05) respectively in Plan 1-3 .For the bladder, the volumetric DSC of ResUNet results was correlated with all dose parameters of the bladder in Plan5, and dD2 in Plan 4-6 was correlated with both DSC metrics.There were few correlation indexes in the small intestine, only volumetric DSC vs. V15 and surface DSC vs. V45/V50.In the results of bilateral femoral heads, both DSC metrics of two CNNs were correlated with the corresponding dD2 in Plan 4-5.Calculation method of surface DSC.(A) acceptable tolerance t value, (B) surface DSC formula, (C) the calculation process taken CTV as an example, in which the red lines in S 1 ∩ B 2ðtÞ; and S 2 ∩ B 1ðtÞ; were the part that exceeds the tolerance.

Discussion
In our retrospective study, geometric and dosimetric evaluations of CTV and OARs for rectal cancer were carried out using manually segmented structures as the ground truth, while commercial software ABAS generated structures as reference.The results showed that the CNN models had a remarkable performance in accuracy and repeatability of automatic segmentation, but their performance in geometric metrics and dose parameters was not completely consistent.
The effect of automatic segmentation requires objective metrics for evaluation.Volumetric DSC is the most commonly used metric and describes the degree of overlap between two structures; however, it weights all misplaced segmentations equally and cannot characterize the distance of the ROI surface.For example, a structure with more proportion needs to be modified slightly and takes a long time may obtain a high volumetric DSC, while a structure requiring a large amount of modification locally and a short time-consuming may have a low volumetric DSC.For the description of surface distance, a commonly used metric is the HD, which represents the maximum of the shortest distances from any delineated point to the other contour (23).At the same time, the limitation lies in its description of the maximum surface distance rather than the degree of surface difference of the entire structure.Therefore, our study includes the surface DSC.This metric has the function of subjective and objective assessment, which contains tolerable interobserver subjective errors, describes the degree of overlap from the perspective of the entire structure, and to a certain extent characterizes the potential cost of modification.Some studies have counted the time spent on performing manually correcting auto-contouring (24).In our previous study, we calculated the computing time of CNNs, in which the average time for generating a case was about 28 s for DeepLabv3+ and 35 s for ResUNet; and we recorded the manual correction time of the results, in which the average cost for CTV DeepLabv3+ , CTV ResUNet , and all OARs was about 11 min, 7 min, and below 5 min respectively (13).With large inter-observer variability, it can still be determined that the manual contouring time is much greater than the total time of automatic contouring and manual modification.
Besides geometric accuracy, the effect of automatically generated ROIs on treatment planning should also be considered.Since the dose distribution is affected by mechanical and physical factors and cannot fully fit the edge of the structure, parts of the automatic contouring that are not perfectly consistent with clinical ground truth may be covered by the isodose lines, which perhaps can be considered as the "robustness" of the structure.Therefore, the evaluation of automatic delineation results should be combined with dosimetry results rather than a single geometric evaluation.There are many methods for dose evaluation; the simplest is to transplant a reference plan into different ROI sets and calculate the dose parameters for comparison (15).However, these parameters do not exist dose distribution replanned using different optimization objects is hardly consistent with the reference plan.Another method is to generate a plan for each group of ROIs and compare their dose characteristics without any reference.We consult the evaluation method for interobserver variation, which assesses plans generated from different ROI sets by calculating the dose parameters of the reference structures (25).Based on this method, we add the definition of "ground truth", design plans using CTV and OAR groups separately and carry out dosimetry evaluation by ROI GT .The results indicate that some crucial dose parameters in the actual plan have no significant difference from the reference plan even if the structure of automatic segmentation cannot completely overlap with ROI GT , especially in the automatically delineated OARs.Interestingly, it can be seen from Plan1 that DeepLabv3+ has a smaller effect on OAR dose distribution, following a similar trend to its dominance in DSC metrics.However, ResUNet-generated small intestine did not show an advantage in the geometric assessment, but its dose parameters in Plan5 did not differ from those in Plan ref .Therefore, we cannot only pursue the improvement of geometric metrics, perhaps the performance assessment of auto-segmentation methods should be combined with dose evaluation, as the latter is more relevant to clinical outcome (26).
At present, our institution has established an integrated platform for automatic delineation (including head, chest, abdomen, pelvis, etc.), which is connected to TPS and CT workstations through the hospital's internal network.The platform can realize CT image transmission, conversion between RS files and masks, and continuous input of abnormal cases (such as recognizing the skull as the femoral head) to improve network performance.This is also a key step in the real application of artificial intelligence to the clinic.The auto-segmentation assessment should integrate subjective and objective methods, but the subjective assessment will introduce inter-observer variability, so it needs multi-center external validation (27).This is a limitation of this study, and we plan to extend the integrated platform to other hospitals and collect data for external validation.In addition, the dose evaluation method also needs to be further improved.In this study, all structures are divided into CTV and OAR groups, and the results may be more targeted and reliable if each structure is taken as a separate variable of the optimization object.In terms of analysis methods, the paired test of dose parameters was used in this study; and what if directly compare RT Dose files and assess dose difference from both global and local perspectives.b.D Abs were the absolute difference between specific dose parameters extracted from the test plans and the reference plan, described by the median (IQR).In particular, the difference of Dn was normalized according to the dDn formula.P values with no statistical significance were bold.

Conclusion
In this study, we evaluated the automatic segmentation results from the perspectives of geometry and dosimetry.The results showed the advantages of speed and repeatability of deep learning in ROI delineation, which is of great help to the routine workflow of radiotherapy.The auto-segmentation function of CNNs is a stability tool for VMAT and IMRT treatment plan design, and it may have further potential in adaptive radiotherapy, which requires repeat CT scans and CTV delineation before each treatment fraction (28).Moreover, with the advancement of MR-Linac, automatic segmentation based on magnetic resonance images has been applied to adaptive radiotherapy, which poses a great challenge to the speed, accuracy, and effect on the dose distribution of the networks (29,30).
The characteristics of convolutional neural networks are different, and the segmentation effect on the ROIs of rectal cancer also differs.We can integrate the two networks or classify them according to the advantageous structure of each network; however, whether further exploration can bring better results requires a comprehensive clinical evaluation.

FIGURE 3
FIGURE 3Statistical analysis results of volumetric and surface DSC.The paired-sample tests were performed between the three auto-segmentation results at a significance level of 0.05(2-tailed), and the missing data in the ABAS dataset (n=42) were replaced with the mean.

FIGURE 4 A
FIGURE 4 A case of structures comparison.The red line represents ground truth, the yellow line represents automatic segmentation of DeepLabv3+, the blue line represents automatic segmentation of ResUNet.

TABLE 1
Initial planning objective set for plan optimization.

TABLE 2
Characteristics of 47 patients.

TABLE 3
Statistical analysis of dose parameter difference between test and reference plans. .P values (2-tailed) marked with * indicated results of paired samples Wilcoxon signed rank test for original dose parameters, while the unmarked P values were the results of paired samples t tests. a