CT-Based Pelvic T1-Weighted MR Image Synthesis Using UNet, UNet++ and Cycle-Consistent Generative Adversarial Network (Cycle-GAN)

Background Computed tomography (CT) and magnetic resonance imaging (MRI) are the mainstay imaging modalities in radiotherapy planning. In MR-Linac treatment, manual annotation of organs-at-risk (OARs) and clinical volumes requires a significant clinician interaction and is a major challenge. Currently, there is a lack of available pre-annotated MRI data for training supervised segmentation algorithms. This study aimed to develop a deep learning (DL)-based framework to synthesize pelvic T1-weighted MRI from a pre-existing repository of clinical planning CTs. Methods MRI synthesis was performed using UNet++ and cycle-consistent generative adversarial network (Cycle-GAN), and the predictions were compared qualitatively and quantitatively against a baseline UNet model using pixel-wise and perceptual loss functions. Additionally, the Cycle-GAN predictions were evaluated through qualitative expert testing (4 radiologists), and a pelvic bone segmentation routine based on a UNet architecture was trained on synthetic MRI using CT-propagated contours and subsequently tested on real pelvic T1 weighted MRI scans. Results In our experiments, Cycle-GAN generated sharp images for all pelvic slices whilst UNet and UNet++ predictions suffered from poorer spatial resolution within deformable soft-tissues (e.g. bladder, bowel). Qualitative radiologist assessment showed inter-expert variabilities in the test scores; each of the four radiologists correctly identified images as acquired/synthetic with 67%, 100%, 86% and 94% accuracy. Unsupervised segmentation of pelvic bone on T1-weighted images was successful in a number of test cases Conclusion Pelvic MRI synthesis is a challenging task due to the absence of soft-tissue contrast on CT. Our study showed the potential of deep learning models for synthesizing realistic MR images from CT, and transferring cross-domain knowledge which may help to expand training datasets for 21 development of MR-only segmentation models.


INTRODUCTION
Computed tomography (CT) is conventionally used for the delineation of the gross tumor volume (GTV) and subsequent clinical/planning target volumes (CTV/PTV), along with organsat-risk (OARs) in radiotherapy (RT) treatment planning. Resultant contours allow optimization of treatment plans by delivering the required dose to PTVs whilst minimizing radiation exposure of the OARs by ensuring that spatial dose constraints are not exceeded. Magnetic resonance imaging (MRI) offers excellent soft-tissue contrast and is generally used in conjunction with CT to improve visualization of the GTV and surrounding OARs during treatment planning. However, manual definition of these regions is repetitive, cumbersome and may be subject to inter-and/or intra-operator variabilities (1). The recent development of the combined MR-Linac system (2) provides the potential for accurate treatment adaption through online MR-imaging acquired prior to each RT fraction. However, re-definition of contours for each MR-Linac treatment fraction requires approximately 10 minutes of downtime whilst the patient remains on the scanner bed, placing additional capacity pressures on clinicians wishing to adopt this technology.
Deep learning (DL) is a sub-category of artificial intelligence (AI), inspired by the human cognition system. In contrast to traditional machine learning approaches that use handengineered image-processing routines, DL is able to learn complex information from large datasets. In recent years, DLbased approaches have shown great promise in medical imaging applications, including image synthesis (3,4) and automatic segmentation (5)(6)(7). There is great promise for DL to drastically accelerate delineation of the GTV and OARs in MR-Linac studies, yet a major hurdle remains the lack of large existing pre-contoured MRI datasets for training supervised segmentation networks. One potential solution is transferring knowledge from pre-existing RT planning repositories on CT to MRI in order to facilitate domain adaptive segmentation (8). Previous studies have reported successful implementation of GANs in generating realistic CT images from MRI (3,(9)(10)(11) as well as MRI synthesis from CT in the brain (12). To date, few studies have investigated MRI synthesis in the pelvis. Dong et al. (13) proposed a synthetic MRI-assisted framework for improved multi-organ segmentation on CT. However, although the authors suggested that synthetic MR images improved segmentation results, the quality of synthesis was not investigated in depth. MR image synthesis from CT is a challenging task due to large soft-tissue signal intensity variations. In particular, MRI synthesis in the pelvis offers the considerable difficulty posed by geometrical differences in patient anatomy as well as unpredictable discrepancies in bladder and bowel contents.
In this study, we compare and contrast paired and unpaired generative techniques for synthesizing T 1 -weighted (T 1 W) MR images from pelvic CT scans as a basis for training algorithms for OARs and tumor delineation on acquired MRI datasets. We include in our analysis the use of state-of-the-art UNet (14) and UNet++ (15) architectures for paired training, testing two different loss functions [L 1 and VGG-19 perceptual loss (16)], and compare our results with a Cycle-Consistent Generative Adversarial Network (Cycle-GAN) (17) for unpaired MR image synthesis. Subsequently, we evaluate our results through blinded assessment of synthetic and acquired images by expert radiologists, and demonstrate our approach for pelvic bone segmentation on acquired T 1 W MRI from a framework trained solely on synthetic 1 W MR images with CT-propagated contours.

Patient Population and Imaging Protocols
Our cohort consisted of 26 patients with lymphoma who underwent routine PET/CT scanning (Gemini, Philips, Cambridge, United States) and whole-body T 1 W MRI (1.5T, Avanto, Siemens Healthcare, Erlangen, Germany) before and after treatment (see Table 1 for imaging protocols). From this cohort, image series with large axial slice angle mismatch between CT and MR images, and those from patients with metal implants were excluded, leaving 28 paired CT/MRI datasets from 17 patients. The studies involving human participants were reviewed and approved by the Committee for

Model Architectures
We investigated three DL architectures for MR image synthesis: (i) UNet, (ii) UNet++, and (iii) Cycle-GAN. UNet is one of the most popular DL architectures for image-to-image translations, with initial applications in image segmentation (14). In essence, UNet is an auto-encoder with addition of skip connections between encoding and decoding sections to maintain spatial resolution. In this study, a baseline UNet model was designed consisting of 10 consecutive convolutional blocks (5 encoding and 5 decoding blocks), each using batch normalization and ReLU activation for CT-to-MR image generation ( Figure 1A). Additionally, a UNet++ model with interconnected skip connection pathways, as described in (15), was developed with the same number of encoder-decoder sections and kernel filters as the baseline UNet ( Figure 1B). UNet+ + was reported to enhance performance (15), therefore we deployed this architecture to assess its capabilities for paired image synthesis.
GANs are the state-of-the-art approaches for generating photo-realistic images based on the principles of game theory (18). In image synthesis applications, GANs typically consist of two CNNs, a generator and a discriminator. During training, the generator produces a target synthetic image from an input image with different modality; the discriminator then attempts to classify whether the synthetic image is genuine. Training is successful once the generator is able to synthesize images that the discriminator is unable to differentiate from real examples. Progressive co-training of the generator and discriminator leads to learning of the global conditional probability distribution from input to target domain. In this study, a Cycle-GAN model (17) was developed to facilitate unpaired CT-to-MR and MR-to-CT learning. The baseline UNet model was used as the network generator, and the discriminator composed of 5 blocks containing 2D convolutional layers followed by instance normalization and leaky ReLU activation. This technique offers the advantage that it does not require spatial alignment between training T 1 W MR and CT images. The high-level schematic of the Cycle-GAN network is shown in Figure 2.
For segmentation, we propose a framework that first generates synthetic T1W MR images from CT, propagates ground-truth CT contours and outputs segmentation contours on acquired T1W MR images. To examine the capability of our fully-automated DL framework for knowledge transfer from CT to MRI, we generate ground-truth contours of the bones using a   (19) and transfer them to synthetic MR images as a basis for our segmentation training. A similar UNet model to the architecture presented in Figure 1, with 5 convolutional blocks (convolution-batchnorm-dropout(p=0.5)-ReLU) in the encoding and decoding sections was developed to perform binary bone segmentation from synthetic MR images. The schematic of our proposed synthesis/segmentation framework is illustrated in Figure 3.

Image Preprocessing
In preparation for paired training, the corresponding CT and T 1 W MR slices from the anatomical pelvic station in each patient were resampled using a 2D affine transformation followed by non-rigid registration using multi-resolution B-Spline free-form deformation (loss = Mattes mutual information, histogram bins = 50, gradient descent line search optimizer parameters: learning rate = 5.0, number of iterations = 50, convergence window size = 10) (20). The resulting co-registered images were visually qualified based on the alignment of rigid pelvic landmarks. In CT images, signal intensities outside of the range -1000 and 1000 HU were truncated to limit the dynamic range. The T 1 W MR images were corrected using N4 bias-field correction to reduce inter-patient intensity variations and inhomogeneities (21) and signal intensities above 1500 (corresponding to infrequent high intensity fatty regions) were truncated. Subsequently, the training images were normalized  to intensity ranges (0,1) and (-1,1) prior to paired (UNet, UNet++) and unpaired (Cycle-GAN) training respectively.

Objective Functions
Common loss functions in image synthesis are mean absolute error (MAE or L 1 ) and mean squared error (MSE or L 2 ) between the target domain and the synthetic output. However, such loss functions ignore complex image features such as texture and shape. Therefore, for UNet/UNet++ models, we compared L 1 loss in the image space with L 1 loss calculated based on the features extracted from a previously-trained object classification network, deriving the "perceptual loss". For this purpose, the VGG-19 classification network was used (16), which is composed of 5 convolutional layers and 19 layers overall, and used features extracted from the "block conv2d" layer. For Cycle-GAN training, the difference between L 1 and the structural similarity index (SSIM) (defined as L 1 -SSIM) was used as the loss to govern the cycle consistency, whilst L 1 and L 2 losses were used for the generator and the discriminator respectively. For segmentation training, the Dice loss (1, 2) was used to perform binary division of bone on MR images.
where A and B denote the generated and ground-truth contours.

Model Training and Evaluation
The dataset was split to 981, 150 and 116 images from 11, 3 and 3 patients for training, validation and testing respectively. All models were trained for 150 epochs using the Adam optimizer (learning rate = 1e-4; UNet and UNet++ models: batch size = 5, Cycle-GAN: batch size=1) on a NVIDIA RTX6000 GPU (Santa Clara, California, United States) ( Table 2). During paired UNet/ UNet++ training, the peak signal-to-noise ratio (PSNR), SSIM, L 1 and L 2 quantitative metrics, as described in (22), were recorded at each epoch for the validation images. The trained weights with the lowest validation loss were used to generate synthetic T 1 W MR images from the test CT images. Optimal weights from the Cycle-GAN model were selected based on visual examination of the network predictions of the validation data following each epoch. Subsequently, synthetic images from all models were evaluated against the ground-truth acquired MR images quantitatively using the above-mentioned imaging metrics. An additional qualitative test was designed to obtain unbiased clinical examination of predictions from the Cycle-GAN model. This test consisted of two sections: (i) to blindly classify randomly-selected test images as synthetic or acquired, and outline reasoning for answers (18 synthetic and 18 acquired test MR slices), and (ii) to describe key differences between synthetic and acquired test T 1 W MR images when the input CT and ground truth acquired MR images were also provided (10 sets of images from 3 test patients). This test was completed by 4 radiologists (two with <5 years and two with >5 years of

RESULTS
Quantitative assessment of synthetic T 1 W MR images from the validation dataset during paired algorithm training suggested that the UNet and UNet++ models with L 1 loss displayed higher PSNR and SSIM, and lower L 1 and L 2 values compared with the generated images from the UNet model with the VGG-19 perceptual loss ( Figure 4). Quantitative analysis of synthetic images from the test patients revealed a similar trend for UNet and UNet++ model predictions and showed that the Cycle-GAN quantitative values were the lowest in all metrics but the SSIM where it was only higher than UNet (VGG) predictions (Table 3). Moreover, qualitative evaluation of predictions from all models revealed a noticeable difference in sharpness (spatial resolution) between the images generated from paired (UNet and UNet++) and unpaired (Cycle-GAN) training. It was observed that despite UNet and UNet++ models generating relatively realistic predictions for pelvic slices consisting of fixed and bony structures (e.g. femoral heads, hip bone, muscles), they yielded blurry and unrealistic patches for deformable and variable pelvic structures such as bowel, bladder and rectum. In contrast, the Cycle-GAN model generated sharp images for all pelvic slices, yet a disparity in contrast was observed for soft-tissues with large variabilities in training patient MRI slices (e.g. bowel content, gas in rectum and bowel, bladder filling) ( Figure 5).
Our expert radiologist qualitative testing on Cycle-GAN predicted images suggested that there were inter-expert variabilities in scores from section one of the test, highlighting the differences in subjective decisions amongst the experts in a number of test images. Experts 1 and 2 (<5 years of experience) scored 67% and 100% whilst experts 3 and 4 (>5 years of experience) correctly identified 86% and 94% of total 36 test images. Hence, no particular correlation was observed between the percentage scores and the participants' years of experience ( Figure 6A). Radiologist comments on the synthetic images (following unblinding) are presented in Figure 6B.
The bone segmentation results using our fully-automated approach showed that our proposed framework successfully performed unsupervised segmentation of the bone from acquired T 1 W MR images, without the requirement of any manually annotated regions of interest (ROIs). The outcome from various pelvic slices across 8 patients from our in-house cohort are presented in Figure 7. The segmentation results from cases 5 to 8 were from patients not used in the synthesis and segmentation components of our framework. Test case 8 demonstrates the predicted bone contours from a patient with metal hip implant.

DISCUSSION AND CONCLUSION
One major limitation in adaptive RT on the MR-Linac system is the need for manual annotation of OARs and tumors on patient scans for each RT fraction which requires significant clinician interaction. DL-based approaches are promising solutions to automate this task and reduce burden on clinicians. However, the development of these algorithms is hindered by the paucity of pre-annotated MRI datasets for training and validation. In this study, we developed paired and unpaired training for T 1 W MR image synthesis from pelvic CT scans as a data generative tool for training of segmentation algorithms for MR-Linac RT treatment planning. Our results suggested that the Cycle-GAN network generated synthetic images with the greatest visual fidelity across all pelvic slices whilst the synthetic images from UNet and UNet++ appeared less sharp, which is likely due to soft-tissue misalignments during the registration process. The observed disparity in contrast in Cycle-GAN images for bladder, bone marrow and bowel loops may be due to large variabilities in our relatively small training dataset. Although the direct impact of these contrast discrepancies on MRI segmentation performance is yet to be evaluated, the Cycle-GAN predictions appeared more suitable for CT contour propagation to synthetic MRI than UNet and UNet++ images due to distinctive soft-tissue boundaries and high-resolution synthesis.
Quantitative analysis of all model predictions indicated that the imaging metrics did not fully conform with the output image visual fidelity and apparent sharpness. This finding was in fact in line previous studies comparing paired and unpaired MRI synthesis (12,22). CT-to-MR synthesis in the pelvis offers the considerable challenge of generating soft-tissue contrasts absent on acquired CT scans. Although quantitative metrics such as the PSNR, SSIM, L 1 and L 2 differences are useful measures when comparing images, they may not directly correspond to photorealistic network outcome. This was evident in quantitative evaluation of the images generated from the UNet and UNet++ models trained with L 1 loss in the image space against UNet with VGG-19 perceptual loss and Cycle-GAN predictions. Therefore, expert clinician qualitative assessments may provide a more reliable insight into the performance of medical image generative networks. In this study, our expert evaluation test based on Cycle-GAN predictions suggested that despite a number of suboptimal soft-tissue contrast predictions (e.g. urinary bladder filling, bone marrow, nerves), there were differences in radiologist accuracies for correctly identifying synthetic from acquired MR images. The fact that 3/4 radiologists were unable to accurately identify synthetic images in all cases highlights the capability of our model to generate realistic medical images that may be indistinguishable from acquired MRI.
Our segmentation results demonstrated the capability of our fully-automated framework in segmenting bones on acquired MRI images with no manual MR contouring. Domain adaptation offers a significant clinical value in transferring knowledge from previously-contoured OARs by experts on CT to MR-only treatment planning procedures. Additionally, it potentially enables expanding medical datasets which are essential for training supervised DL models. Such a technique is also highly valuable outside the context of radiotherapy, as body MRI has increasing utility for monitoring patients with secondary bone disease from primary prostate (23) and breast (24) cancers, and multiple myeloma (25). Quantitative assessment of response of these diseases to systemic treatment using MRI is hindered by the lack of automated skeletal delineation algorithms to monitor changes in large volume disease regions (26).
GANs are notoriously difficult to train due to their large degree of application-based hyper-parameter optimization and non-standardized training techniques. However, this study showed that even when trained on relatively small datasets, GANs may have the potential to generate realistic images to overcome the challenge of medical image data shortage. Therefore, fut ure studies will investigate the performance of the proposed framework on larger datasets and alternative pelvic OARs, as well as exploring novel techniques to enforce targeted organ contrast during GAN and segmentation training. Additionally, future research will examine the performance sensitivity on the level of manual MRI contours required for training cross-domain DL algorithms.

DATA AVAILABILITY STATEMENT
The data analyzed in this study is subject to the following licenses/ restrictions: The datasets presented in this article are not readily available due to patient confidentiality concerns. Requests to access these datasets should be directed to matthew.Blackledge@icr.ac.uk.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the Committee for Clinical Research at the Royal Marsden Hospital. The patients/participants provided written informed consent to participate in this study.