Deep-learning-based generation of synthetic 6-minute MRI from 2-minute MRI for use in head and neck cancer radiotherapy

Background Quick magnetic resonance imaging (MRI) scans with low contrast-to-noise ratio are typically acquired for daily MRI-guided radiotherapy setup. However, for patients with head and neck (HN) cancer, these images are often insufficient for discriminating target volumes and organs at risk (OARs). In this study, we investigated a deep learning (DL) approach to generate high-quality synthetic images from low-quality images. Methods We used 108 unique HN image sets of paired 2-minute T2-weighted scans (2mMRI) and 6-minute T2-weighted scans (6mMRI). 90 image sets (~20,000 slices) were used to train a 2-dimensional generative adversarial DL model that utilized 2mMRI as input and 6mMRI as output. Eighteen image sets were used to test model performance. Similarity metrics, including the mean squared error (MSE), structural similarity index (SSIM), and peak signal-to-noise ratio (PSNR) were calculated between normalized synthetic 6mMRI and ground-truth 6mMRI for all test cases. In addition, a previously trained OAR DL auto-segmentation model was used to segment the right parotid gland, left parotid gland, and mandible on all test case images. Dice similarity coefficients (DSC) were calculated between 2mMRI and either ground-truth 6mMRI or synthetic 6mMRI for each OAR; two one-sided t-tests were applied between the ground-truth and synthetic 6mMRI to determine equivalence. Finally, a visual Turing test using paired ground-truth and synthetic 6mMRI was performed using three clinician observers; the percentage of images that were correctly identified was compared to random chance using proportion equivalence tests. Results The median similarity metrics across the whole images were 0.19, 0.93, and 33.14 for MSE, SSIM, and PSNR, respectively. The median of DSCs comparing ground-truth vs. synthetic 6mMRI auto-segmented OARs were 0.86 vs. 0.85, 0.84 vs. 0.84, and 0.82 vs. 0.85 for the right parotid gland, left parotid gland, and mandible, respectively (equivalence p<0.05 for all OARs). The percent of images correctly identified was equivalent to chance (p<0.05 for all observers). Conclusions Using 2mMRI inputs, we demonstrate that DL-generated synthetic 6mMRI outputs have high similarity to ground-truth 6mMRI, but further improvements can be made. Our study facilitates the clinical incorporation of synthetic MRI in MRI-guided radiotherapy.

. Overview of CycleGAN DL architecture. The general architecture consisted of a generator network (top) and PatchGAN discriminator network (bottom). Layers are represented as colored rectangles where K = kernel size and S = stride size. Each layer is followed by instance normalization and a leaky rectified linear unit activation function.

Appendix C. 6-minute MRI vs. 2-minute MRI Initial Survey
As initial motivation towards developing a synthetic MRI deep learning model, we created a survey to gauge physician preferences for ground-truth 2-minute vs. ground-truth 6-minute scans. Neuroimaging Informatics Technology Initiative (NIfTI) formatted ground-truth 2-minute scans and 6-minute scans of the 18 test cases were randomly relabeled as either "Image A" or "Image B". These blinded images were provided to four physician observers to be visualized in 3D Slicer (9). Observers were free to alter the window width and level at their discretion.
Observers were instructed to document their preference ("Image A" or "Image B") in a spreadsheet for a set of regions of interest (ROI)s: right parotid gland, left parotid gland, left submandibular gland, right submandibular gland, spinal cord, brainstem, mandible, primary tumor, and metastatic lymph node(s). Not all images had all ROIs present (i.e., some patients had gland or tumor resections). Observer preferences were remapped to the original image identifiers to determine which observers preferred 6-minute scans vs. 2-minute scans for each ROI ( Figure C1). With the exception of one observer, who preferred glandular structures on 2minute scans, all observers overwhelmingly preferred 6-minute scans over 2-minute scans for all ROIs. Figure C1. Observer preferences for visualizing regions of interest (ROI) on ground-truth 2minute scans (red) vs. ground-truth 6-minute scans (green) for a variety of ROIs.

Appendix D. Additional Auto-segmentation Data
A previously trained head and neck cancer organ at risk (OAR) auto-segmentation model initially developed in independent 2-minute MRI scans was applied to the ground-truth 2-minute, ground-truth 6-minute, and synthetic 6-minute MRI scans in the test set. Examples of autosegmented OARs overlaid on images and in 3D volumetric format for ground-truth 2-minute and ground-truth 6-minute scans in one case are shown in Figure D1. Figure D1. Organ at risk auto-segmentation 3D representation and axial/coronal/sagittal views for ground-truth 2-minute (A) and ground-truth 6-minute (B) scans for 1 representative case where all structures were correctly contoured. Right parotid gland, left parotid gland, left submandibular gland, right submandibular gland, spinal cord, brainstem, and mandible, are represented by the dark blue, green, light blue, orange, teal, brown, and purple structures. Visualizations generated in 3D Slicer.
Interobserver variability (IOV) cutoffs for each OAR were determined for the Dice similarity coefficient (DSC) and average surface distance (ASD) from supporting data in work by McDonald et al. (10). For equivalence tests, we also implemented the interquartile range (IQR) values as the minimum (-IQR) and maximum (+IQR) equivalence bounds. A table of the estimated values is shown below ( Table D1). In the main manuscript we do not include the OARs whose median values between ground-truth 6-minute and ground-truth 2-minute scans do not cross the corresponding IOV median value (i.e., lower than threshold in case of DSC or higher than threshold in case of ASD) as these structures would likely not be clinically acceptable, i.e., spinal cord, brainstem, left/right submandibular glands. However, for completeness, we show the full bar plot representations for all OARs below ( Figure D2). DSC and ASD equivalence tests (two one-sided t-tests) were nonsignificant (p > 0.05) for the spinal cord, brainstem, left submandibular gland, and right submandibular gland. Finally, we also show DSC and ASD bar plots for all OARs for direct comparisons of ground-truth to synthetic 6-minute images in Figure D3.

Appendix E. Additional Image Similarity Data
In bias field correction, with N4 bias field correction before application of a sharpening kernel (main results described in manuscript), and with N4 bias field correction after application of a sharpening kernel. Generally, metrics improved slightly or remained similar with N4 bias field correction, and worsened after application of the sharpening kernel. In addition to similarity metric analysis, we also performed a preliminary investigation of radiomic features for synthetic vs. ground-truth images. Specifically, we sought to determine the reliability/repeatability of various radiomic feature classes for the previously segmented OARs on the synthetic images generated by the model described in the main manuscript (with N4 bias field correction before application of sharpening kernel). OAR segmentations on 2-minute scans were resampled to the ground-truth 6-minute and synthetic 6-minute scans using a nearest neighbor interpolator. As before, auto-segmented structures for patients which were not present (i.e., glands post-resection) were not included in the analysis. Radiomic feature extractions were performed on z-score normalized images. Using the open-source toolkit, PyRadiomics (11), we extracted the standard default features from first order statistics (firstorder; 19 features), grey level co-occurrence matrix (glcm; 24 features) gray level run length matrix (glrlm; 16 features), neighbouring gray tone difference matrix (ngtdm; 5 features), and gray level dependence matrix (gldm; 14 features) from OARs on ground-truth 6-minute and synthetic 6-minute scans. The default PyRadiomics extraction parameters, e.g., fixed bin width, were applied as recommended. Shape features were not extracted since the OAR mask was the same between the scans. We utilized the two-way mixed effects, consistency, single rater/measurement intraclass correlation coefficient (ICC) provided by the pinguoin Python package (12) to calculate ICC values for each feature class/OAR combination. ICC targets were individual patients, raters were the different images (ground-truth, synthetic), and ratings were the OAR radiomic feature values. ICC values less than 0.5 were categorized as non-reliable, while ICC values greater than or equal to 0.5 were categorized as reliable. The ICC results stratified by OAR and radiomic feature category are shown in Figure E1. Generally, a greater number of firstorder features were considered reproducible than non-reproducible for most OARs. For gldm and glrlm, a smaller proportion of features were considered reproducible for most OARs.
For ngtdm, only a small number of features for the spinal cord and brainstem were considered reproducible. Finally, for glcm and glszm no features were considered reproducible. Future work should investigate the utility of using synthetic images for radiomic feature calculation in MRIguided adaptive radiotherapy workflows in greater depth. Figure E1. Radiomic feature reliability/repeatability on synthetic scans compared to groundtruth scans stratified by region of interest (ROI) and feature category. firstorder = first order statistics, glcm = grey level co-occurrence matrix, glrlm = gray level run length matrix, ngtdm = neighbouring gray tone difference matrix, gldm = gray level dependence matrix.

Appendix F. Turing Test Additional Data
The Turing test was initially performed with raw image outputs and subsequently after the application of a simple sharpening kernel. The same expert physician observers were given rerandomized slice representations of the same cases one week after the initial Turing test. Only the results after application of the sharpening kernel are displayed in the main manuscript. For completeness we display the results of the Turing test without the sharpening kernel below. As opposed to the results with the application of the sharpening kernel, the original outputs were often distinguishable due to a slight systematic blurring effect. Table F1 shows the Turing test and clinician preference results while Figure F1 shows the stratified clinician preference results. Table F1. Turing test and image preference results for three physician expert observers before application of sharpening kernel. Each observer was asked to determine the image identity of blinded paired ground truth (GT) or synthetic 6-minute MRI scan slices in a randomized fashion and also provide their preference. Two one sided tests for two proportions were applied to determine if observer estimates were equivalent to chance.  Figure F1. Clinician image preferences stratified by region for Turing test before application of sharpening kernel. Green = ground-truth 6-minute MRI slice, yellow = synthetic 6-minute MRI slice.
Additionally, observers were instructed to provide comments where desired to indicate specific reasons for categorizing images as either ground-truth or synthetic. The raw comments for the Turing test before (Table F2) and after (Table F3) application of the sharpening kernel are shown below for each observer.