Geometric and Dosimetric Evaluation of Deep Learning-Based Automatic Delineation on CBCT-Synthesized CT and Planning CT for Breast Cancer Adaptive Radiotherapy: A Multi-Institutional Study

Purpose We developed a deep learning model to achieve automatic multitarget delineation on planning CT (pCT) and synthetic CT (sCT) images generated from cone-beam CT (CBCT) images. The geometric and dosimetric impact of the model was evaluated for breast cancer adaptive radiation therapy. Methods We retrospectively analyzed 1,127 patients treated with radiotherapy after breast-conserving surgery from two medical institutions. The CBCT images for patient setup acquired utilizing breath-hold guided by optical surface monitoring system were used to generate sCT with a generative adversarial network. Organs at risk (OARs), clinical target volume (CTV), and tumor bed (TB) were delineated automatically with a 3D U-Net model on pCT and sCT images. The geometric accuracy of the model was evaluated with metrics, including Dice similarity coefficient (DSC) and 95% Hausdorff distance (HD95). Dosimetric evaluation was performed by quick dose recalculation on sCT images relying on gamma analysis and dose-volume histogram (DVH) parameters. The relationship between ΔD95, ΔV95 and DSC-CTV was assessed to quantify the clinical impact of the geometric changes of CTV. Results The ranges of DSC and HD95 were 0.73–0.97 and 2.22–9.36 mm for pCT, 0.63–0.95 and 2.30–19.57 mm for sCT from institution A, 0.70–0.97 and 2.10–11.43 mm for pCT from institution B, respectively. The quality of sCT was excellent with an average mean absolute error (MAE) of 71.58 ± 8.78 HU. The mean gamma pass rate (3%/3 mm criterion) was 91.46 ± 4.63%. DSC-CTV down to 0.65 accounted for a variation of more than 6% of V95 and 3 Gy of D95. DSC-CTV up to 0.80 accounted for a variation of less than 4% of V95 and 2 Gy of D95. The mean ΔD90/ΔD95 of CTV and TB were less than 2Gy/4Gy, 4Gy/5Gy for all the patients. The cardiac dose difference in left breast cancer cases was larger than that in right breast cancer cases. Conclusions The accurate multitarget delineation is achievable on pCT and sCT via deep learning. The results show that dose distribution needs to be considered to evaluate the clinical impact of geometric variations during breast cancer radiotherapy.


INTRODUCTION
Intensity-modulated radiotherapy (IMRT) after breastconserving surgery significantly improves the survival of breast cancer patients (1). However, there are the patient setup error and anatomical structure changes during the interfractional radiotherapy (2). The variation range of mean central lung distance is 0.59-2.94 cm (3). The mean 3D displacement of patient setup is 7.3 and 7.6 mm by laser and port film setup, respectively (4). The deviation could lead to inconsistencies between the actual delivery dose and the planning dose (5). Large interfraction variation is observed, motivating the need for adaptive radiotherapy. Adaptive radiotherapy can automatically adjust the plan according to changes in the target volume (6,7). When the patient is lying on the couch waiting for treatment, plan evaluation and adaptation need to be completed as quickly as possible. Online adaptation, which requires real-time delineation of the contours of the target volumes and organs at risk (OARs) for re-planning, is a promising technique (8). Some studies have been conducted for online adaptation, especially for prostate cancer as well as for head and neck cancer (9)(10)(11)(12)(13). Conebeam CT (CBCT) is a common tool for location verification in radiotherapy and can be used for plan adaptation (14,15). However, imaging artifacts caused by respiratory movement make CBCT-based adaptive radiotherapy for breast cancer infeasible. CBCT images cannot be directly used for dose calculation due to inaccurate HU values and needs to be converted into synthetic CT for dosimetric evaluation (16)(17)(18)(19).
The delineation of target volumes and OARs is a prerequisite for adaptive radiotherapy. However, manual delineation is timeconsuming and labor-intensive and cannot meet the requirements of real-time adaptive radiotherapy (20). It is necessary to build an automatic delineation model (21,22). Some researchers used atlas-based segmentation software for delineation of target volumes on computed tomography (CT) images for radiotherapy. Dice score of segmentations with these commercial software is not high enough (23)(24)(25)(26). CBCT-based delineation can be achieved by deformable image registration and direct delineation on CBCT images. Deformable image registration could transfer the contours to CBCT images from planning CT images (27). However, deformable registration relying on the image quality and algorithm cannot perform well for patients with large variations, leading to uncertainty in propagating contours (28). Direct CBCT-based delineation can reduce uncertainty from registration errors. Schreier et al. (29) investigated segmentation for the male pelvis using CBCT and CT images. Peroni et al. (11) developed an automatic strategy to generate online virtual CT and automatically segmented structures using CBCT and virtual CT images for head and neck cancer adaptive radiation therapy. Inter-observer variability is high in the delineation of target volumes and OARs on CT and CBCT scans of the chest. At present, most of the studies on the automatic segmentation of chest medical images do not perform well, and it is necessary to develop a model with better performance to delineate all the target volumes and OARs accurately at one time. Additionally, the geometric metrics do not fully indicate clinical quality. Therefore, it is necessary to evaluate the performance of the automatic delineation model in terms of clinical applicability (30). The geometric and dosimetric changes between planning CT (pCT) and synthetic CT (sCT) needs to evaluated due to high clinical significance for adaptive therapy.
In our study, we investigated the feasibility of automatically delineating multiple contours based on deep learning for breast cancer radiation therapy. The synthetic CT image was first generated from CBCT images with a cycle generative adversarial network (cycleGAN). Second, we developed an automatic delineation model using 3D U-net based on pCT and the radiotherapy structure of breast cancer patients to delineate the target volumes and OARs on planning CT and synthetic CT images, respectively. Third, the treatment plan was transferred to the synthetic CT image from the planning CT image. It could be verified quantitatively by quick dose recalculation for dosimetric evaluation. The Flowchart of the proposed method is shown as Supplementary Figure 1 in the Supplementary Material. The clinical impact of geometric variations in target volumes and OARs was evaluated to provide the feasibility for breast cancer adaptive radiotherapy.

Patient Datasets
Datasets obtained from two medical institutions in China between January 2014 and December 2020 were analyzed retrospectively. A total of 1,127 patients (institution A: 1,074/ institution B: 53) who received radiotherapy after breastconserving surgery were included. The data of 75 patients from institution A including pCT and CBCT images with BH were split into 52 samples for training and 23 samples for testing on CBCT-to-CT synthesis. The pCT images and structures of 1,052 patients were randomly divided into a training set (700 patients), validation set (100 patients), and test set (252 patients) for automatic delineation model training and evaluation. Among the test set, 199 patients were from institution A, and 53 patients were from institution B. The patient characteristics are summarized in Table 1.
Patients were immobilized on the breast bracket before obtaining a CT scan with 5 mm slices (SOMATOM Sensation Open, Siemens). The dimensions of the images were 512 × 512 voxels for each slice. CBCT is widely used for target position verification and setup error correction during breast patient radiation therapy. In this study, CBCT images were captured by a Varian Edge treatment machine (Varian Medical Systems, Palo Alto) in half-fan mode utilizing breath-hold (BH) guided by AlignRT (Vision RT Ltd, London, UK). The images with 512 × 512 voxels for each slice were reconstructed with a 1.99 mm slice thickness. The quality of CBCT with the BH technique is much higher than that with conventional scan. CBCT could be used to generate accurate synthetic CT for dose recalculation.
Clinical target volumes (CTV), tumor bed (TB), and organs at risk (OARs) were delineated by two radiation oncologists according to the ESTRO consensus guideline. The director with 20 years' experience of the radiation oncology department of the corresponding institution was consulted in cases of disagreement. Manually delineated contours were used as ground truth for training and testing. The radiotherapy plans were designed with the Pinnacle3 treatment planning system (Philips Radiation Oncology Systems, Philips Healthcare) by combining 3D conformal radiation therapy (CRT) and intensity modulated radiation therapy (IMRT). The prescribed dose was 52.2 Gy given to CTV and 63.8 Gy given to TB in 29 fractions. CRT achieved 80% of the total dose, and IMRT achieved 20% of the total dose. One hundred percent of the prescribed dose covered 95% of the volume of the target area. The physicist first designed the plan, and then the radiation oncologist and the physicist jointly evaluated the plan before implementation to ensure the quality of the plan.

Automatic Delineation of Target Volumes and Organs at Risk
Our automatic delineation model is applied to 3D volume. Patch-based training makes the model cannot judge left and right lungs and target volumes. Therefore, through a series of preprocessing steps, as shown in Figure 1A, we feed the human body volume as completely as possible into the network, so that the network has a larger receptive field. First, the Hough transform line detection, the threshold method, and the morphological method were used to remove the bed and obtain a human body mask. Then, the minimum circumscribed cube is cropped from the body mask, which is the region of interest (ROI) of each subject. Finally, the spatial resolution of all ROIs is converted to 2 × 2 × 5 mm 3 , and the intensity is normalized by z-score for training.
We use Res-SE-U-Net (31) as the automatic delineation network, which is a modified 3D U-Net (32). Res-SE-U-Net includes the down-sampling path, up-sampling path, and skipconnection layer, which can extract the multiscale features of images. In addition, the addition of Res-block and SE-block leads to its stronger feature extraction ability than the original U-Net. The workflow of the automatic delineation model is shown in Figure 1. Training of a network took about 48 h, whereas all the target volumes and OARs for one patient were predicted in 5 s.

CBCT-to-CT Synthesis
We use a 2D cycleGAN (33), which is an unsupervised image-toimage translation deep learning framework, to generate sCT. The cycleGAN contains two generators (G CBCT-CT and G CT-CBCT ) and two discriminators (D CT and D CBCT ). The generator G CBCT-CT takes CBCT as input and generates the sCT; in contrast, G CT-CBCT takes CT as input and generates the synthetic CBCT. The discriminator D CT and D CBCT discriminates whether the CT or CBCT images are real or synthesized, respectively. The sCT of one patient could be generated in 3-4 s by the trained model. The schematic flow is shown as Supplementary Figure 2 in the Supplementary Material. The optimization of the cycleGAN includes two objective functions: adversarial loss and cycle consistency loss. The antagonistic objectives of generators and discriminators are reflected in adversarial loss. We denote the data distribution as I CT~pdata (I CT ) and I CBCT~pdata (I CBCT ). The adversarial loss is expressed as: where I CBCT is the real CBCT and G CBCT-CT (I CBCT ) is the synthetic CT generated by G CBCT-CT .

Automatic Delineation Performance Evaluation
The performance of the automatic delineation model was evaluated on pCT and sCT, respectively. To quantitatively assess the delineation accuracy, we used two metrics: Dice similarity coefficient (DSC) and 95% Hausdorff distance (HD95). DSC describes the spatial overlap between the automated delineation and the ground-truth. The metrics HD95 was used to evaluate the shape difference in the study. The equations are defined as Supplementary Equations (1, 2) in the Supplementary Material.

CBCT-Synthesized sCT Quality Evaluation
Mean error (ME), mean absolute error (MAE), peak signal to noise ratio (PSNR), structural similarity index (SSIM), and spatial non-uniformity (SNU) were used to evaluate the image quality of sCT and CBCT, respectively. The formulas for these metrics are defined as Supplementary Equations (4)(5)(6)(7)(8) in the Supplementary Material. We selected five regions of interest (ROIs) to calculate the SNU, as shown in Supplementary Figure 3 of the Supplementary Material. ME and MAE are the magnitudes of the difference between the pCT and the sCT. The lower these values are, the better the image quality is. High PSNR and SSIM mean high image quality. In this study, deformable registration was performed on the sCT to align it with pCT, and the metrics were calculated within the body mask of the sCT.

Dosimetric Evaluation
Dosimetric accuracy was evaluated based on the sCT images using clinical breast cancer treatment plans. A quick dose recalculation on the sCT images was performed to verify the treatment plan. The treatment plan that was transferred to the sCT from the pCT kept the same parameters as the original pCTbased plan. The difference in dose distribution between pCT and sCT was evaluated with gamma analysis and dose-volume histogram (DVH) parameters. The difference in DVH metrics of target volumes and OARs between pCT and sCT were also assessed for quantitative dosimetric evaluation. The DVH metrics of the target volumes, including D90, D95, and V95, were analyzed. Target coverage was defined as the dose received by 90 and 95% of the target volume (D90, D95) and the percent volume receiving 95% of the prescribed dose (V95) for the TB and CTV. If the dose difference in the target volumes and OARs on the sCT exceeds the threshold, it needs to rescan the pCT for re-planning. DD90, DD95, and DV95 are defined as: We investigated the relationship between the DSC and the dose difference to evaluate the effect of anatomical changes on dose during radiotherapy. sCT images were rigidly registered to pCT by reference to the bony landmarks. DSC-CTV for automatically delineating CTV on sCT images compared with manually delineating target volumes on pCT images was correlated with dosimetric metrics.

Clinical Evaluation
The reproducibility and robustness of the automatic delineation model were evaluated using DSC and HD95 in a multiinstitution study. The training set was from institution A. Of the 252 patients in the test set, 199 were from institution A and 53 were from institution B. The robustness of the model was validated by multi-institutional testing.
Clinical evaluation of the automatic delineation model was performed on an independent test set of 199 pCT scans and 23 sCT from institution A. The automatically delineating contours were checked by three groups a, b, and c in institution A based on their clinical experience. Each group consisted of two radiation oncologists. Three groups who were blinded to the ground truth contours reviewed the automatic delineating contours. The evaluation results were acceptable with no corrections, acceptable with minor corrections, and unacceptable respectively. Finally, the acceptable ratio of all the targets and OARs was analyzed.

Statistical Analysis
MATLAB (version 2018b, The MathWorks Inc, Natick, MA, USA) was used for statistical analysis. A t test was used to test the statistical significance of the absolute difference of the dosimetric metrics for both plans, and a Spearman's rank correlation test was used for correlation testing between DSC and DV95 or DD95 of CTV. Two-sided p-values were provided, and p-values <0.05 were considered significant.

Geometric Evaluation
The ME, MAE, PSNR, and SSIM comparisons between the CBCT, synthetic CT (sCT), and planning CT (pCT) images are shown in Table 2. The average ME and MAE between CBCT and pCT images within the body was −37.71 ± 15.49 and 86.42 ± 10.12 HU, whereas the average ME and MAE between sCT and pCT images was 8.46 ± 11.88 and 71.58 ± 8.78 HU. The mean SNUs for CBCT, sCT, and pCT were 9.22 ± 3.89, 4.95 ± 4.13, 2.12 ± 0.85%, respectively. The HU value of sCT image is much closer to that of pCT image than that of CBCT image. The similarity increased obviously between sCT and pCT images with lower ME, MAE and higher PSNR, SSIM compared to CBCT and pCT images. The detailed comparison between pCT and sCT is shown as Supplementary Figure 3 in the Supplementary Material. Figure 2 shows examples of the ground truth and the contours of automatic delineation on pCT and sCT images. There is good consistency for CTV and OARs between automatic delineation and manual delineation from human experts on pCT. The concordance can be found to decrease at the upper and lower bounds of the CTV from 2D sagittal sCT images. The automatically drawn tumor bed (TB) on sCT is obviously larger than the manually drawn TB.
The testing results of automatic delineation of multiple institutions are shown in Table 3. The results of pCT were calculated among 199 patients from institution A and 53 patients from institution B, respectively. The results of sCT were calculated among 23 patients from institution A for CBCT-to-CT synthesis testing cohorts. Good DSC and HD95 scores were found for the most contours on pCT (DSC: 0.73-0.97, HD95: 2.22-9.36 mm). The performance was slightly lower for the contours on sCT from institution A (DSC: 0.63-0.95, HD95: 2.30-19.57 mm). The mean DSC of CTV was 0.88 ± 0.03 for pCT, and 0.83 ± 0.03 for sCT, respectively. The segmentation model was also effective for pCT from institution B (DSC: 0.70-0.97, HD95: 2.10-11.43 mm). The mean DSC of CTV on pCT from institution B was 0.80 ± 0.06. The accuracy of automatic delineation for the datasets from institution B was lower than that from institution A.
The clinical evaluation showed that the acceptable ratios of OARs, CTV, and TB were 76. 38 Table 4. Overall, the automatic delineation of CTV and OARs was clinically acceptable after minor corrections by the evaluation of medical group a, b, and c.

Dosimetric Evaluation
The dose distribution and dose-volume histogram (DVH) of the plan on synthetic CT (sCT) and planning CT (pCT) are shown in   Within the body, the mean ± standard deviation with 2%/2 mm and 3%/3 mm pass rates for the sCT images were 85.09 ± 6.28 and 91.46 ± 4.63% respectively. There was a negative correlation between DSC and DV95 or DD95 (r= −0.52, p= 2.4075E-11 and r= −0.51, p= 4.5815E-11, respectively). DSC of CTV down to 0.65 accounted for a variation of more than 6% of V95 and 3 Gy of D95 for CTV. DSC of CTV up to 0.80 accounted for a variation of less than 4% of V95 and 2 Gy of D95 for CTV, as shown in Figure 4.
Dosimetry evaluation of the plans on sCT versus original plans for 29 fractions is summarized in Table 5. The mean DV95 of CTV is less than 6%. The mean DD90/DD95 of CTV is less than 2 Gy/4 Gy. The mean DD90/DD95 of tumor bed was less than 4 Gy/5 Gy for all patients. We observed an absolute difference of more than 8% of the DV95 of TB. The poor delineation accuracy of TB leads to large dosimetry errors. The mean DD95 of CTV and DV10 of the heart are 4.20 ± 1.45 Gy and 3.92 ± 3.29% in the left-sided patients, 2.84 ± 0.84 Gy and 1.60 ± 1.96% in the right-sided patients, respectively. The dosimetric difference of target volume and heart in left-sided patients is greater than that in right-sided patients. Pvalues of the dosimetric difference of the TB and CTV were below 0.05, and p-values of the dosimetric difference of the OAR were over 0.05.

DISCUSSION
Adaptive radiotherapy based on CBCT for patient setup is a promising approach for improving treatment accuracy (34). Liu et al. (35) developed a deep learning approach to generate CBCTbased synthetic CT images and validated the dose calculation accuracy for clinical use in CBCT-guided pancreatic adaptive radiotherapy. However, sCT-based segmentation was not involved, which was also the key factor in adaptive radiotherapy. In our study, the synthetic CT image quality analysis of the thorax yielded an ME/MAE of 8.46/71.58 HU, and our results were better than those reported by Eckl et al. (36), with 29.6/94.2 HU. The SNU in synthetic CT was close to the SNU in planning CT. It demonstrated that synthetic CT had enough quality for contour delineation and dose calculation. The automatic delineation model that was developed by 3D Res-SE-U-Net based on the planning CT and RT structures made full use of the 3D correlative information between image slices. The automatic delineation model performs well enough with 0.88 DSC for CTV on planning CT. The performance of the model was lower for synthetic CT because of the inaccurate HU and image artifacts caused by respiratory in synthetic CT. The clinically meaningful evaluation of the performance of the model should include not only geometric difference but also dosimetric assessment (37,38). In this study, the geometric and dosimetric differences between the contours on planning CT and the contours on synthetic CT were analyzed to assess the clinical impact of the changes in target volumes and OARs during radiotherapy. The interobserver variability is large for the contours of breast cancer, resulting in difficulty delineating exactly (39). We improved the robustness of the model by enlarging the datasets to over 1,000. The automatic delineation of CTV and OARs was more consistent with the manual delineation due to their regular shapes and locations. Tumor bed was significantly different between automatic and manual contours because the position of the tumor bed varied greatly for each patient. The results of the multi-institutional test showed that the model is robust and accurate. The performance of the model on the datasets from institution B was worse than that from institution A. Our approach was effective in dosimetric verification based on synthetic CT from CBCT, and DV95 and DD95 of CTV could be used as dosimetric metrics for rescanning   pCT. There was a correlation between DSC and DD95 and DV95 for CTV; however, it was not an inversely proportional relationship. Poor DSC scores do not necessarily lead to inferior CTV dosimetry. The dose variation between the automatic delineation CTV on sCT images and the manual delineation CTV on planning CT images was not large if the DSC value was low due to the automatic delineation being too small. We evaluated the absolute difference of the dosimetric metrics for both plans. Geometric changes on synthetic CT have a greater impact on the cardiac dose difference in left breast cancer, and special attention needs to be paid to assess the cardiac dose for left breast cancer. DV95, DD95, and DD90 could be used as evaluation indicators for whether to re-plan. Additional limitations include the following: (1) Deformable registration from planning CT to CBCT was performed because of the different slice thicknesses and scanned areas between the two images. Although deformable registration was used, it was difficult to align the anatomical structure in CBCT with the same structure in planning CT. The image quality of synthetic CT could be degraded due to the registration errors, which affect the delineation accuracy of the model to a certain extent. (2) DSC-CTV was computed by rigid registration between synthetic CT and planning CT. However, the limited registration accuracy could cause certain dosimetric uncertainties in CTV. (3) The synthetic CT was generated from any day's CBCT, not entire treatment course. The anatomical changes and dosimetric difference were not evaluated during entire treatment delivery course.
The ranges of clinically acceptable ratio for CTV delineation are between 70% and 83% among the different groups, showing no common objective evaluation of the delineation. Variability exists between observer groups, demonstrating that the difference between automatic and manual delineation depends not only on contouring routines and guidelines, but also on personal preference. In the future, we hope to develop a universal model that can not only meet the quality requirements of multiple clinical institutions, but also adapt to the personal preferences of each observer. The automatically generated contours could be carefully reviewed by the radiation oncologist and used for treatment planning.

CONCLUSION
This study demonstrated that the developed approaches are capable of reliably generating target and OAR contours on pCT and daily sCT images from CBCT images, which could greatly accelerate the re-planning process and meet the requirements of online plan adaptation. The automatic delineation model performed sufficiently well for most patients. The geometric and dosimetric differences between pCT and sCT images in fractional radiotherapy need to be evaluated due to the high clinical significance for breast cancer adaptive radiotherapy.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the medical ethics committee of the Second Affiliated Hospital of Guangzhou University of Chinese Medicine. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.