The Impact of Clinical Trial Quality Assurance on Outcome in Head and Neck Radiotherapy Treatment

Purpose: To investigate the impact of radiation treatment quality assurance (RTQA) on treatment outcomes in a phase III trial for advanced head and neck cancer. Materials and Methods: A total of 767 patients from NRG/RTOG 0522 were included in this study. The contours of target volume (TV) and organ at risk (OAR), and dose-volume coverage of targets were reviewed and scored (per-protocol, variation-acceptable and deviation-unacceptable) according to the protocol. We performed log-rank tests for RTQA scores with patients' outcomes, including local control (LC), distant control (DC) and overall survival (OS). Cox models with and without RTQA score data were established. To obtain a more reasonable model, per-protocol and variation acceptable were combined into a single acceptable score. Results: The log-rank test showed that all RTQA scores correlated with LC, which was significantly different between the per-protocol and variation-acceptable patients in target and OAR contouring (p-value = 0.004 and 0.043). For dose-volume score, the per-protocol and variation-acceptable patients were significantly different from unacceptable patients in the LC, with a p-value = 0.020 and 0.006, respectively. The DC of patients with variation-acceptable was significantly different than that of the unacceptable patients (p-value = 0.043). There were no correlations between RTQA scores with other outcomes. By incorporating RTQA scores into outcome modeling, the performance of LC model can be improved from 0.62 to 0.63 (c-index). The RTQA scores had no impact on DC and OS. Conclusion: RTQA scores are related to patients' local control rates in head and neck cancer radiotherapy.


INTRODUCTION
Clinical trial quality assurance (QA) programs have been shown to be vital in ensuring that inter-institutional differences do not dilute trial results (1). In large multi-institutional trials, credible assessment of the comparative role of radiation therapy (RT) is only possible if the delivered RT is well-documented and sufficiently homogeneous in its delivery. Furthermore, it has been demonstrated that non-adherence to protocol-specified RT requirements for plan quality is associated with reduced survival and local tumor control, and can potentially lead to increased toxicity (2)(3)(4)(5)(6)(7).
Most RTOG clinical trials have a radiation therapy quality assurance (RTQA) process that evaluates RT scores (contour, dose distribution) retrospectively or prospectively. Quality assurance is a resource intensive process, both from the institutions' and from the clinical trial QA centers' perspective. Furthermore, radiation therapy is a field utilizing rapidly evolving technologies such as the introduction over the last few decades of the electronic portal imaging device (EPID), the multileaf collimator (MLC), delivery technologies of intensity modulated radiation therapy (IMRT), volumetric-modulated arc therapy (VMAT) and cone-beam computed tomography (CBCT) for image guided radiotherapy (IGRT). Ensuring high quality implementation of these technologies has tremendously increased the workload for the entire radiotherapy team, and, thus, different QA procedures need to be prioritized (8). How to determine which QA methodology is relevant and efficient is of crucial significance.
With the emergence of individualized medicine and the increasing complexity (9), it is difficult to evaluate the value of one factor, which may correlate with other clinical factors. By establishing a reliable prediction model, the value of this factor can be assessed.
The aim of this study is to conduct an analysis of the correlation between RTQA scores and patient's outcome; and to evaluate the clinical value of RTQA scores by developing a quantitative predictive model of clinical outcome that contains RTQA scores and other clinical factors.
The study was performed in two parts: first we analyzed the correlation between the patient characteristics, RTQA scores and the patients' outcome; then, a logistic regression model was used to establish the prediction model. The accuracy of the model was validated by cross-validation and c-index.

Trial Protocol and RTQA Process
The RTOG protocol provides details of the trial design, treatment regimens (10). Briefly, patients with stage III-IV carcinoma of the oropharynx, larynx, and hypopharynx, having Zubrod performance of grade 0 to 1, and meeting predefined blood chemistry criteria were enrolled after providing informed consent. From November 2005 to March 2009, 940 patients were enrolled. After removing patients with incomplete RTQA scores data, 767 patients were enrolled in this study. All the Abbreviations: RTQA, radiation treatment quality assurance; TV, target volume; OAR, organ at risk; LC, local control; DC, distant control; OS, overall survival; QA, quality assurance; RT, radiation therapy; EPID, electronic portal imaging device; MLC, multileaf collimator; IMRT, intensity-modulated radiotherapy; VMAT, volumetric-modulated arc therapy; CBCT, cone-beam computed tomography; IGRT, image guided radiotherapy; TV_SCORE, score of target volume; OAR_SCORE, score of organ at risk; TV_DVA_SCORE, score of target dosevolume coverage; GTV, gross tumor volume; CTV, clinical target volume; PTV, planning target volume; EQD2, equivalent dose in fractions of 2Gy. patients passed the initial scrutiny according to the RTOG protocol. Table 1 shows patients' characteristics. Event rates at 5 years of follow-up for these patients were 80.1% for local control, 76.3% for distant control, and 66.3% for overall survival. Median follow-up times were 36.8 months for local control, 37.0 months for distant control, and 42.4 months for overall survival.
The case review processes (which included contour and dosimetry evaluations) were performed retrospectively by the radiation oncology and radiation physics co-chairs as described in the protocol. A quality score (per-protocol, variation acceptable and deviation unacceptable) was given to contouring and planning for major target and normal structures through the review process according to the protocol. The final overall quality score of target volume (TV_SCORE), organ at risk (OAR_SCORE) and target dose-volume coverage (TV_DVA_SCORE) is determined by the worst score in these categories. Table 2 shows the criteria for evaluation of target volume and dosimetry scores.

Prognostic Factors and Correlation Analyses
The prognostic factor selection was based on Egelmeer's study (11). Clinical factors, including age at start of RT, IMRT, gender, T-stage, N-stage, primary tumor site, hemoglobin level, equivalent dose in fractions of 2Gy (EQD2) which were calculated from RT scores are selected. To simplify the model, primary tumor site was categorized into 4 groups: oropharynx supraglottic larynx, hypopharynx and others. Similarly, T-stage and N-stage were encoded into 4 ranks. EQD2 was calculated by the following formula (12): D is the total radiation dose, d is the fraction dose, α/β is 10 Gy, T is the overall treatment time, accelerated repopulation kick-off time (T k ) is 28 days, and loss in dose due to repopulation (γ ) is 0.6 Gy/day. After transformation, the median EQD2 is 61.6Gy (range, 20.62-65.80Gy). Among prognostic factors, age, hemoglobin level, and EQD2 were analyzed as continuous values. Spearman correlation coefficient were calculated between clinical factors and RTQA scores. For tumor location, the chisquare test was performed to evaluate its relationship with RTQA scores. To evaluate the relationship between RTQA scores and patients' outcome, we performed log-rank tests for RTQA scores with patients' outcome. Since there are three levels for each RTQA score, the log-rank tests were performed between each two levels, including per-protocol vs. variation acceptable, per-protocol vs. unacceptable and variation acceptable vs. unacceptable.

Prediction Model and Model Performance Evaluation
We used a simple modeling strategy to develop our prediction model. First, a univariate analysis was performed to select Briefly, patients were randomly separated into a training (90%) and validation dataset (10%). The model was developed in a training dataset and we assessed the performance in a validation dataset. We used c-index to evaluate model performance. To get stable results, the whole process is repeated 10 times. To get a more reasonable model, we combined per-protocol score and variation acceptable score in RTQA score into acceptable in modeling part. R (Version 3.3.0) was used to perform all the statistics analysis and model development. Figure 1 shows the result of the correlation analyses. The p-value for the chi-square test between RTQA and primary tumor site was 0.019, 0.002, and 0.147 for TV_SCORE, OAR_SCORE, and TV_DVA_SCORE, respectively. Figure 2 shows the Kaplan-Meier curves for different RTQA scores. The log-rank test showed that all RTQA scores are correlated with patients' local control. For target and OAR contouring, the per-protocol is significantly different with variation acceptable, where p-value = 0.004 and 0.043, respectively. For dose-volume score, the perprotocol and variation acceptable are significantly different with unacceptable, where p-value = 0.020 and 0.006, respectively. The dose-volume score is also correlated with patients' distant control. The variation acceptable and unacceptable are significantly different, p-value = 0.043. There is no correlation between RTQA scores with other outcomes. Table 3 shows the c-index of the prediction model. By incorporating RTQA score, the performance of the prediction model for local control was improved for 0.622 to 0.632. The RTQA scores have no impact on distant control and overall survival. Figure 3 shows the nomogram with RTQA scores for local control which demonstrates the value of RTQA scores in clinical outcomes.

DISCUSSION
In this study, the relationship between the RTQA scores and clinical outcomes were analyzed and the value of the RTQA scores was evaluated by prediction models. The results showed that the qualities of contouring and treatment plan are correlated to patient local control. Further analysis demonstrated only dose-volume score can be used as an independent factor for patient's local control prediction. Although, dose-volume score is correlated with patient distant control, there is no clinical value for this score in patient's distant control prediction. RTQA criteria has direct impact on the final RTQA score, especially for the dosimetry evaluation. Strict criteria will increase the plan difficulty and decrease the ratio of per protocol plans. For example, in this study, 63.9% patients belong to per-protocol of target dose-volume score. If we use a more loose criteria such as variation acceptable, 91.3% patients will belong to this category (Figure 4). How to find appropriate QA criteria is of utmost importance and remains a significant challenge. It would be better to analyze enough cases before defining the per-protocol and variation acceptable limits.
We grouped per-protocol and variation acceptable into one category in modeling base on the original ideal of the quality score. The original purpose of the quality score in the protocol was to provide a mechanism for stating the prescription for normal situations and more difficult treatment planning situations; the per-protocol criterion is used to encourage institutions to devise treatment plans that are as tight as possible in terms of dose conformity for PTV coverage. The variation acceptable compliance criterion is given to allow leeway for more difficult treatment planning situations. The deviation unacceptable is used to indicate incorrect prescription (13). However, this combination may decrease the model performance for prediction modeling. As Figure 4 shows, deviation unacceptable only has a few patients, especially for contour quality score (6.0 and 3.1%). This may cause some bias also in statistics analysis; the log-rank test shows that Kaplan-Meier curves are significantly different for contour score perprotocol and variation acceptable. However, the unacceptable group is not significantly different from other groups. Obviously, it was not reasonable. This bias could be corrected by including more data.
For dose-volume score, the target dose-volume coverage low quality score was caused mainly by two factors. For one, the geometry of the case makes radiotherapy planning so difficult that it is impossible to achieve a per-protocol plan (e.g., the target volume may have a large overlap with organ at risk or technology limitations that do not allow extremely rapid dose falloff). These are the cases that would be scored as variation acceptable. The other possible explanation is the planner may not have sufficient experience and skills to find an acceptable plan. The TROG study shows the quality of radiotherapy is most highly correlated with the number of patients enrolled at each center (3). This speaks to both the issue of experience and skill and the issue of available advanced technologies. To distinguish these two factors, further investigation is necessary.
In this study, we used the criteria defined in the protocol. The target dose coverage, minimum dose and maximum dose were included in the evaluation criteria. This trial started in 2006 when the IMRT technology had not been implemented fully, the reason OAR dosimetry quality score is not recorded. For the same reason, two RTQA parameters (OAR contour quality score and Target dose-volume score) were not recorded if the patient was not treated with IMRT in this study.
There were few patients who fell into the "deviation unacceptable" category, making statistical correlation with this outcome measure difficult. One valuable study is to repeat the analysis in another large dataset in which there was a higher percentage of cases with unacceptable deviations. Abrams et al. (7) also investigated the impact of adherence to specified RT protocol guidelines on protocol outcomes for pancreatic adenocarcinoma. They found that failure to adhere to specified RT guidelines was associated with reduced survival and, for patients receiving gemcitabine, trend toward increased non-hematologic toxicity. In our study, we did not analysis the impact of RTQA score on radiotherapy toxicity. Further, both of the above analyses will be performed in future studies.
Although the experiment using current indexes demonstrated a promising result, there is a big challenge of the very subjective nature of both contouring and assessment of contours. The cases from multi-center have large deviation in contouring due to the inter-observe variation. Similarly, for contour quality score, these quality scores also depend mainly on the physicians' subjective judgment. Moreover, the scoring has only three levels. Therefore, more objective and quantitative criteria is needed for contour assessment. Quantitative evaluation of contours may become more feasible in the future with technology developments in the areas of functional imaging, deformable registration, and contour atlas. We are working on this and will attempt to analyze more clinical trials to investigate the influence of RTQA score on treatment outcome.

CONCLUSIONS
This exploratory analysis found that the RTQA scores were related to patient local control in RTOG 0522 trial. The influence of the subjective nature of quality scoring remains unknown. A more reasonable controlled trial with objective and pre-designed quality index merits further investigation.

DATA AVAILABILITY
The datasets generated and/or analyzed during the current study are not publicly available according to the Data Sharing Policy of NRG Oncology. Requests to access the datasets should be directed to [NRG Oncology, APC@nrgoncology.org].

ETHICS STATEMENT
This retrospective study was approved and carried out in accordance with the Data Sharing Policy of NRG Oncology. All the data are de-identified. In addition, based on the retrospective character of this analysis, no additional written informed consent was needed.