Predictive Values of MRI and PET Derived Quantitative Parameters for Patterns of Failure in Both p16+ and p16– High Risk Head and Neck Cancer

Purpose: FDG-PET adds to clinical factors, such tumor stage and p16 status, in predicting local (LF), regional (RF), and distant failure (DF) in poor prognosis locally advanced head and neck cancer (HNC) treated with chemoradiation. We hypothesized that MRI-based quantitative imaging (QI) metrics could add to clinical predictors of treatment failure more significantly than FDG-PET metrics. Materials and methods: Fifty four patients with poor prognosis HNCs who were enrolled in an IRB approved prospective adaptive chemoradiotherapy trial were analyzed. MRI-derived gross tumor volume (GTV), blood volume (BV), and apparent diffusion coefficient (ADC) pre-treatment and mid-treatment (fraction 10), as well as pre-treatment FDG PET metrics, were analyzed in primary and individual nodal tumors. Cox proportional hazards models for prediction of LRF and DF free survival were used to test the additional value of QI metrics over dominant clinical predictors. Results: The mean ADC pre-RT and its change rate mid-treatment were significantly higher and lower in p16– than p16+ primary tumors, respectively. A Cox model identified that high mean ADC pre-RT had a high hazard for LF and RF in p16– but not p16+ tumors (p = 0.015). Most interesting, persisting subvolumes of low BV (TVbv) in primary and nodal tumors mid-treatment had high-risk for DF (p < 0.05). Also, total nodal GTV mid-treatment, mean/max SUV of FDG in all nodal tumors, and total nodal TLG were predictive for DF (p < 0.05). When including clinical stage (T4/N3) and total nodal GTV in the model, all nodal PET parameters had a p-value of >0.3, and only TVbv of primary tumors had a p-value of 0.06. Conclusion: MRI-defined biomarkers, especially persisting subvolumes of low BV, add predictive value to clinical variables and compare favorably with FDG-PET imaging markers. MRI could be well-integrated into the radiation therapy workflow for treatment planning, response assessment, and adaptive therapy.


INTRODUCTION
Locoregional failure (LRF) remains a clinical challenge for poor prognosis locally advanced squamous cell carcinoma of the head and neck (HNSCC) treated with definitive chemoradiation therapy (CRT) (1). It is important to identify imaging markers of LRF that identify patients and tumor subvolumes that may benefit from intensified locoregional therapy in the form of radiation boost, targeted systemic therapy, or surgical intervention.
Despite this progress, it has been difficult to determine which imaging biomarkers should be used to individualize treatment for the patients with locally advanced HNSCC. Most head and neck cancer imaging studies to date include heterogeneous populations of various disease sites, stages, and prognosis. Few imaging studies investigate how p16 status affects imaging parameters pre-and mid-treatment. With respect to ADC in particular, no study to date has evaluated ADC changes during RT for p16+ vs. p16-tumors. A single study investigated ADC differences between HPV+ and HPV-HNSCC, including only 6 HPV+ patients (8%), and found that pre-treatment ADC in HPV+ HNSCC patients was significantly lower than in HPVpatients (19). Furthermore, at the tumor and subtumor level, there is no report on imaging biomarker differences between tumors with local, regional, or distant failure as site of first failure compared to disease free patients. This is an important issue, as it would help stratify the patients for local or systemic intensified or de-intensified therapy. Finally, poorly perfused tumor subvolumes are largely spatially distinct from areas of high FDG uptake and high restricted water diffusion in the same patients, and the spatial correlation between high glucose metabolism and high restricted water diffusion varies greatly from patient to patient (20,21). These studies question whether both FDG PET and MRI biomarkers are necessary to guide adaptive RT in HNSCC.
This study aimed to (1) investigate p16+ effects on imaging parameters and their early response rates; (2) assess differences between imaging biomarkers of tumors with local, regional or distant progression and those with no evidence disease (NED), and (3) compare the predictive values of MRI and PET biomarkers. We hypothesized that p16+ status could affect imaging biomarkers and their early response rates, and MRIbased QI metrics could add to clinical predictors of treatment failure more significantly than FDG-PET metrics for local, regional and distant failure.

Patients
Imaging analysis was performed on 54 patients [median age of 61 years; 7 females; 31 p16+ (57%)] with advanced HNSCC who were enrolled in a randomized phase II clinical trial between March 2014 and January 2018 ( Table 1). The trial was approved by the Institutional Review Board of the University of Michigan, including a parallel imaging study to investigate the predictive values of QI metrics for tumor progression. Written consent was obtained from all enrolled patients. Eligibility included patients with p16+ T4/N3 squamous cell carcinoma of oropharynx or locally advanced p16-HNSCC if planned to undergo definitive CRT. All patients were evaluated for p16 status by immunohistochemistry. After completion of CRT, patients were followed up every 2-3 months per standard care for oncologic outcomes as well as toxicity. Tumor recurrences were scored as LF, RF, or DF, or a combination thereof.

MRI and PET Acquisition
Patients underwent FDG-PET/CT scans pre-RT within 4 weeks of RT as a part of standard care. Clinical FDG-PET/CT scans were performed on various PET scanners by following the standard clinical protocol (22).
MRI scans were acquired pre-RT (within 2 weeks) and at fraction 10 (20 Gy) as a part of the protocol. All MRI scans were acquired on a 3T scanner (Skyra, Siemens Healthineers), including anatomic, diffusion weighted (DW), and DCE T1weighted imaging series. All patients were scanned in the treatment position using an individual-patient immobilization 5-point mask and bite block or aquaplast mold as required for treatment. DW images were acquired with spatial resolution of ∼1.2 × 1.2 × 4.8 mm and b-values of 50 and 800 s/mm 2 by either a 2D spin-echo single shot echo-planar pulse sequence or a readout segmentation of long variable echo-trains (RESOLVE) pulse sequence that reduced geometric distortion (23). Sixty T1weighted DCE image volumes were acquired using a 3D gradient echo pulse sequence in a sagittal orientation with voxel size ∼1.5 × 1.5 × 2.5 mm during an injection of one standard dose of Gd-DTPA. Post-Gd T1-weighted images were acquired in the axial plane with spatial resolution of 0.875 × 0.875 × 3.3 mm by a 2D fast spin echo sequence with fat saturation.

Image Analysis and Registration
Blood volume (BV) maps were quantified from DCE-MRI using the modified Tofts model implemented in an in-house imFIAT Analysis Tool, which was validated using a digital reference object (24). ADC maps were calculated from DW images with b-values of 50 and 800 to mitigate the perfusion effect by using in-house software that was technically validated in a QIN collaborative project (25). Since using the individual-patient immobilization devices reduced gross movement of head and neck during scanning dramatically, BV and ADC maps were reformatted to match voxel-by-voxel of post-Gd T1-weighted images acquired in the same session using coordinates in DICOM headers. SUV of FDG-PET was calculated. Pre-RT FDG-PET/CT and mid-treatment MR images were co-registered to pre-RT post-Gd T1-weighted images using rigid body transformation and mutual information. Target displacement errors, including image mis-registration and geometric distortion in ADC maps, between image series were assessed and reported previously (20). Reproducibility of BV maps was 16%, which was reported previously (26).

Tumor Volumes and Subvolumes
Gross tumor volume (GTV) of primary and nodal disease was contoured individually on post-Gd T1-weighted images by treating attending head and neck radiation oncologists and reviewed by the trial PI (MM). For this cohort of patients with locally advanced HNSCC, gross cystic or necrotic regions and tumor invasion into blood vessels occurred in many tumors, and therefore were excluded from the GTVs for following analyses of quantitative image (QI) metrics by applying simple thresholds. For the ADC analysis, a threshold of >2.7 × 10 −3 mm 2 /s (10% below free water diffusion) was used to exclude gross necrosis and blood vessels, and a threshold of <0.0001 × 10 −3 mm 2 /s was used to exclude air. Then, a low BV subvolume of the GTV (TV BV ) was created using a threshold of BV <7.64 ml/100 g reported previously based upon a histogram analysis (16). The low ADC subvolume of the GTV (TV ADC ) was defined as ADC < 1.2 × 10 −3 mm 2 /s based on an ADC-histogram analysis (20), which is also consistent with the mean ADC reported by others (21). A MTV was defined as FDG SUV >50% of a value averaged over 4 voxels with maximum SUVs (MTV 50 ).

Quantitative Imaging Metrics
QI metrics in tumor volumes and their mid-treatment changes were analyzed for prediction of LF, RF, and DF. Tumor volume metrics included GTV, TV BV , TV ADC , MTV 50 . Mean values of ADC and BV in GTV excluding blood vessels and necrosis, mean and max SUVs in MTV 50 , and TLG of MTV 50 were calculated for each primary or nodal tumor as well as for all tumors in each patient.

Treatment
The patients were randomized to a standard arm of RT (70 Gy in 35 fractions) or an experimental arm. In the experimental arm, a union of the persisting TV BV pre-RT to 2 weeks and persisting TV ADC pre-RT to 2 weeks received 2.5Gy per fraction for the last 15 of 35 fractions. If the union of persisting subvolumes pre-RT to 2 weeks was <1 cc, the patient was entered into an observation arm and treated by the standard RT (70 Gy in 35 fractions). Patients were planned to receive weekly cisplatin 40 mg/m 2 , and patients considered to be cisplatin ineligible were treated with weekly carboplatin AUC2.

Statistical Analysis
First, we assessed the p16 effect on imaging parameters and parameter change rates at 2 weeks compared to pre-RT using the Mann-Whitney U-test. Secondly, we assessed whether MRI and PET biomarkers had similar predictive values for LRF and DF free survival. For the analysis of LRF, most previous analyses considered either LF, RF, or LRF as an event, of which the model was useful for stratification of the patients but not for stratification of the tumors for intensified adaptive RT. Tumor progression could occur in one or a few treated tumors (primary or nodal tumor) or in none. Therefore, we applied Cox proportional hazards models to individual (primary or nodal) tumors for prediction of failure. The individual tumor failure free rate (ITFFR) was defined from the start of RT to the date of progression of the tested (primary or nodal) tumor. ITFFR times were censored for all tumors from a patient at the earlier of DF, death or last follow-up. Whether primary and nodal tumors can be analyzed together was tested for each imaging parameter. To compare the predictive values of MRI and FDG PET biomarkers, imaging metrics were assessed one at a time in models also including p16 as a co-variable, which is the most important clinical variable for LRF (27)(28)(29). Distant failure free survival (DFRS) was defined as the time interval from the start of RT to the date of DF. The Cox models were fitted including a single QI metric and clinical stage T4/N3 vs. other (non-T4/N3) as the sole clinical variable (30)(31)(32), and entering one imaging parameter at a time. Each QI metric was summed up or averaged over all nodal tumors for volume-related or intensity-related metrics, respectively. In the DFFS model, patients were censored at the first occurrence of any local or regional failure, death or last follow-up. If there were any significant differences of imaging parameters between p16-and p16+ tumors, we considered an interaction term in the Cox model or an analysis in different Cox models as appropriate. Since multiple comparisons were made, p-values were corrected using false discovery rate (FDR) control, and corrected p < 0.10 were considered significant. Finally, we assessed if there were any significant differences of imaging biomarkers between the tumors that never progressed, those that demonstrated local or regional progression, and those that were locoregionally controlled but metastasized distantly. This landmark analysis used outcomes at 18 months as a cutoff. The tumors were excluded from the analysis if the tumor had local or regional progression after 18 months or the tumor had no progression but the follow-up was shorter than 18 months. As the data were not Gaussian distributed, non-parametric tests were used: Kruskal-Wallis test for the three-group comparison and Wilcoxon rank test for the comparison between local or regional failure and NED. The p-values were corrected with FDR control, and <0.1 were considered as significant. Since 37% of the patients received higher doses, we tested the dose effect before performing the proposed analyses.

Treatment Failure
This cohort of 54 patients with locally advanced HNSCC had large primary GTVs with a median value of 60.5 cc (range: 10.2-595.2 cc; SD: 86.8 cc; Table 1), which was several times greater than most reported studies (2-6, 9-11, 33). Eleven patients (20%) (3 p16+) have had local recurrence. Nine patients (17%) (2 p16+) have had regional recurrence, including one patient (p16-) who failed regionally at two separate treated lymph node locations, and 2 (1 p16-and 1 p16+) who had RF at the locations of non-enlarged/non-FDG avid nodes before RT. Fourteen patients (7 p16+) had distant failure with or without local and regional failure. All cases with LF or RF alone were confirmed pathologically, and distant metastases were diagnosed pathologically or by overt radiographic presentation. Twelve patients have died of HNC (3 p16+), and one patient died cancer-free of other causes. For the patients who did not have progression at the time of analysis, median follow-up was 24 months (range: 10-58 months).

Predictive Values of MRI and PET Imaging Parameters for Local and Regional Progression
First, we did not detect significant difference in local and regional control rates between two-dose arms yet so that the patients who received different doses were analyzed together. For prediction of local progression, mean ADC pre-RT of primary tumors was the only parameter found significant in a univariate Cox model. Since there was no significant difference in mean ADC between primary and nodal tumors, we combined primary and nodal tumors in a single model (53 primary tumors and 82 nodal tumors). For prediction of ITFFR, considering the p16 effect on mean ADC of primary tumors, the Cox model included p16 status, pre-RT mean ADC, and the interaction of pre-RT mean ADC and p16 status. We found that p16 had a significant effect on tumor control (HR p16+ vs. p16-of 0.21, p = 0.005), pre-RT mean ADC had a significant effect in p16-tumors (HR per 1 SD increase in ADC = 1.9, p = 0.015) but no effect in p16+ tumors (HR = 1.0, p = 1.0). The interaction between p16 status and ADC was not statistically significant (p = 0.24, Table 2).
Since QI metrics other than mean ADC were significantly different between primary and nodal tumors (p < 0.05), the QI  metrics of nodal tumors were tested separately for prediction of regional failure free rates. In Cox models of 82 nodal tumors with p16 status as a co-variate, GTV pre-RT and at 2 weeks, TV BV at 2 weeks, mean and max SUV in MTV 50 pre-RT, MTV 50 pre-RT, TLG pre-RT, and change in GTV at 2 weeks vs. pre-RT were significant with p < 0.07 with FDR control, see Table 3. It is interesting to note that GTV pre-RT and at 2 weeks as well as mean SUV and TLG pre-RT have the highest c-index (> 0.9). However, MTV 50 and TLG as well as TV BV were strongly correlated with GTV pre-RT (range of r between 0.88 and 0.90), suggesting that these metrics are not independent of GTV. The mean and max SUV in MTV 50 were strongly correlated each other (r = 0.98) but modestly correlated with GTV pre-RT (range of r between 0.65 and 0.67).

Predictive Values of Imaging Biomarkers for Distant Progression
For prediction of distant progression, Cox models identified that TV BV of primary tumors at 2 weeks, total TV BV of all nodal tumors pre-RT and at 2 weeks, total GTV of all nodal tumors at 2 weeks, mean and max SUV of all nodal MTV50 pre-RT, and TLG pre-RT of all nodal tumors had a nominal p < 0.05 without FDR. With FDR control, total GTV of all nodal tumors at 2 weeks, mean and max SUV of all nodal MTV50 pre-RT, and TLG pre-RT had a p < 0.1, see Table 4. We tested whether the significant predictors could provide any complimentary information to clinical stage of T4/N3 and the sum of all nodal GTVs at 2 weeks for prediction of DF, and found that neither total TV BV , nor mean and max SUV, nor total TLG of all nodal tumors had a p < 0.3,   and only TV BV of primary tumors at 2 weeks showed marginally significant (p = 0.06).

Imaging Biomarkers for Differentiation of Tumors With LF (or RF), DF, and NED
For primary tumors, the subvolumes of low BV pre-RT showed a descending trend from LF, DF, and NED with a marginally significant p-value of <0.06 without FDR control, see Table 5. Figure 3 shows the subvolumes of low BV of primary tumors with LF, DF, and NED pre-RT and at 2 weeks as well as its change rates after 10 factions of RT. Post ad hoc analysis showed that the change rates of low BV subvolume were significant smaller in primary tumors with DF (−0.05% ± 0.16%) than tumors with LF (−0.49 ± 0.08%) and tumors with NED (−0.45 ± 0.09%) with p values of <0.03 and <0.015, respectively. For nodal tumors, GTV pre-RT and at 2 weeks, the subvolume of low BV pre-RT, mean ADC at 2 weeks, mean BV at 2 weeks, and mean/max SUV of MTV 50 pre-RT were different among DF, RF, and NED groups with p < 0.05 without FDR control and p ≤ 0.1 with FDR control, see Table 5. Again, GTV of nodal tumors was a strongest parameter to differentiate the three groups with different outcomes. Regarding the difference between DF and RF groups, only mean BV values at 2 weeks had a p < 0.05 without FDR control but p > 0.1 with FDR control. Figure 4 shows GTVs, the subvolumes of low BV, mean ADC, and mean BV of nodal tumors with RF, DF, and NED pre-RT and at 2 weeks. Figure 5 shows mean SUV, max SUV, and TLG of nodal tumors with RF, DF, and NED pre-RT.

DISCUSSION
In this study, we investigated p16 effects on MRI and PET QI metrics, imaging biomarker differences as a function of tumor control (local, regional, or distant), and the predictive values between MRI and PET biomarkers for tumor progression in locally advanced poor prognosis HN cancers. Our cohort of patients had large tumor volumes compared to previously reported literature (2-6, 9-11, 33). We found the p16-primary tumors had elevated ADC values pre-RT and low early response rates compared to p16+ tumors; the latter of which has not been previously reported. Also, high mean ADC value pre-RT is a hazard for local and regional failure of p16-tumors. Multiple MRI and PET imaging parameters (including GTV, ADC, BV, SUV, and TLG) predicted RF and DF, but the nodal GTV defined on anatomic MRI was the strongest predictor. Most interesting, we report for the first time that the persistent low BV in primary and nodal tumors during the early course of CRT is associated with high-risk for distant failure. In order to identify patients who may benefit from intensified local therapy in the form of a radiation boost or surgical intervention, or from intensified systemic therapy (30,34), we analyzed the significant imaging predictors found in Cox modeling for differentiation of the tumors that were controlled compared to those with LF, RF, or DF. The performance of MRI related parameters is stronger than PET parameters. Although PET is a part of standard care, MRI could play an important role from treatment planning, to early response assessment, and boost target definition.
We found a p16 effect on ADC and ADC change rates during the early course of RT. The p16-primary tumors had significantly greater mean ADC values pre-RT and smaller increases in ADC after 2 weeks of CRT than p16+ primary tumors. Furthermore, the p16-tumors from patients with local or regional failure had significantly greater mean ADC values pre-RT and midtreatment than those from disease free patients. These results are consistent with previous reports that the pre-RT high ADC is negatively prognostic for HN cancers (9)(10)(11). A recent study shows that ADC is significantly and inversely correlated with cell density but also significantly and positively correlated with Tumor volume is a unit of cc. ADC is in unit of 10 −3 mm 2 /s. BV is in unit of (ml/100 g). SUV of FDG is in unit of g/ml. TLG is in unit of 100 g. *Indicates significant. the percentage area of stroma in laryngeal and hypopharyngeal carcinoma (35). The former finding has been reported previously in animal studies, prostate cancer and lymphomas (36)(37)(38)(39), and is related to restricted water diffusion due to high cellularity. The latter finding suggests that a large percentage area of stroma in HN cancers is associated with a high ADC. Stroma has been shown to be negatively prognostic in several cancers, to promote tumor growth and invasion, and to potentially protect tumors from delivery of chemotherapy (40)(41)(42)(43)(44)(45). ADC behaviors in the p16-tumors could be explained by their increased stroma. HPV-related oropharynx cancers are histologically basaloid in histology with significant tumor lymphocytic infiltration, which is associated with improved prognosis (46,47) and decreased ADC. ADC, although a promising QI metric for differentiation of local and regional failure, and even distant failure, is affected by multiple biologic and physiologic factors, including cell density and stroma as well as cyst and necrosis (In this study, we excluded grossly cystic and necrotic regions for QI metric analysis). The low BV in primary tumors and persisting during the early course of RT have reported previously to be associated with LF (12)(13)(14)(15)(16)(17). However, there is no report that the low BV and its low response rate in HNSCC during the early course  of RT is associated with DF. The subvolumes of low BV in primary tumors show a descending trend from LF, to DF and NED. The response rate of low BV could be used to differentiate the tumor at high-risk for LF or DF from NED, and thereby adapting intensified local or systematic therapy for the patients with different progression risks.
Pretreatment FDG QI metrics, including MTV, TLG and mean/max, have been reported to be correlated with PFS and OS in the patients with HN cancers treated with CRT (3-6). We found that the high mean/max SUV and large TLG in nodal tumors were risk factors for nodal failure, and that the sum of TLG over all nodal MTVs was a negative prognostic factor for DFFS, which is consistent with several previous reports (2)(3)(4). Although TLG accounts for both the size and SUV of MTV, we found that nodal TLG was strongly correlated with MRI-defined GTV, and nodal GTV was the strongest predictor for RF in our study. For prediction of RF and DF, several other MRI parameters (including GTV, ADC, and BV) perform as well as FDG PET related parameters. When including T4/N3 and total nodal GTV in the Cox model, no other imaging parameters including PET were found to be significant. Finally, there were no FDG PET related parameters that could predict LF.
Radiomics analysis of CT and PET features is another area of imaging analysis that could provide complimentary information to the present study. Radiomics analysis that extracts the large amounts of quantitative textural features from CT, PET, and MRI has been investigated for the prediction of local control, PFS, and OS in head and neck cancers (48)(49)(50)(51)(52). Through the feature selection and reduction processes, a small number of features have been found to have prognostic or predictive value. These features include general categories of statistical energy, shape compactness, gray level non-homogeneity, and gray level non-uniformity. These features may represent different tumor phenotypes. However, it is hard to link the feature to tumor physiology, pathology and biology. Furthermore, radiomics approaches require a large amount of high quality image data, and high-throughput.
A limitation of the present analysis includes RT boost of tumor subvolumes with persistent low BV and low ADC on our clinical trial. This could affect QI metrics that are identified for prediction of treatment failure. We will perform this analysis on patients who are on the standard treatment arm when the trial is completed and the data have matured. Nevertheless, we found that persistent low BV in primary and nodal tumors carries a high-risk for nodal and distant failure, the low response rate of low BV has a high-risk for distant failure, and the low response rate of ADC is for p16-primary tumors. MRI derived biomarkers perform at least as well as FDG PET defined ones. As MRI based planning is already wellintegrated into radiation therapy, our findings suggest that MRI based response assessment will be a valuable guide in adaptive radiation therapy.

DATA AVAILABILITY STATEMENT
The image data that were collected in the patients with at least 10 months follow-up in the study are included in the manuscript/supplementary files.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the institute review board of University Michigan. The patients/participants provided their written informed consent to participate in this study.