Determining Clinical Patient Selection Guidelines for Head and Neck Adaptive Radiation Therapy Using Random Forest Modelling and a Novel Simplification Heuristic

Purpose To determine which head and neck adaptive radiotherapy (ART) correction objectives are feasible and to derive efficient ART patient selection guidelines. Methods We considered various head and neck ART objectives including independent consideration of dose-sparing of the brainstem/spinal cord, parotid glands, and pharyngeal constrictor, as well as prediction of patient weight loss. Two-hundred head and neck cancer patients were used for model development and an additional 50 for model validation. Patient chart data, pre-treatment images, treatment plans, on-unit patient measurements, and combinations thereof were assessed as potential predictors of each objective. A stepwise approach identified combinations of predictors maximizing the Youden index of random forest (RF) models. A heuristic translated RF results into simple patient selection guidelines which were further refined to balance predictive capability and practical resource costs. Generalizability of the RF models and simplified guidelines to new data was tested using the validation set. Results Top performing RF models used various categories of predictors, however, final simplified patient selection guidelines only required pre-treatment information for ART predictions, indicating the potential for significant ART process streamlining. The simplified guidelines for each objective predicted which patients would experience increases in dose to: brainstem/spinal cord with sensitivity = 1.0, specificity = 0.66; parotid glands with sensitivity = 0.82, specificity = 0.70; and pharyngeal constrictor with sensitivity = 0.84, specificity = 0.68. Weight loss could be predicted with sensitivity = 0.60 and specificity = 0.55. Furthermore, depending on the ART objective, 28%-58% of patients required replan assessment, less than for previous studies, indicating a step towards more effective patient selection. Conclusions The above ART objectives appear to be practically achievable, with patients selected for ART according to simple clinical patient selection guidelines. Explicit ART guidelines are rare in the literature, and our guidelines may aid in balancing the potential clinical gains of ART with high associated resource costs, formalizing ART trials, and ensuring the reproducibility of clinical successes.


INTRODUCTION
The spatial accuracy of IMRT and VMAT for head and neck radiotherapy can degrade over the course of treatment as tumor volumes and patient anatomy change. Previous studies in the literature indicate median decreases in gross tumor volume of 70% (1), and average weight loss of 8% (2) over the course of radical (chemo)radiotherapy. These anatomical changes may cause doses to organs-at-risk (OAR), such as the parotid glands, to increase in by >10 Gy (3), and target coverage to degrade by >5% (4) in select patients. Adaptive radiation therapy (ART) replans patient treatments in response to anatomical changes, with single-institution clinical trials showing that ART may improve 2-year local regional control by 9% (5), reduce xerostomia and dysphagia by an estimated 11% (6) and significantly improve post-treatment quality of life (7).
Treatment replanning is simple in concept, yet routine ART is hampered by practical constraints. Replanning all head and neck cancer patients can place a significant burden on dosimetry, medical physics, and other departments (8). In addition, only about 20% of patients are expected to benefit from replanning (3), however, criteria to effectively identify these patients have not yet been established in the literature. Current patient selection for treatment replanning is often subjective, according to clinician discretion, making it challenging to reproduce the above ART trial results and successes. Simple ART patient selection approaches, such as monitoring changes in a patient's external contour, may be no better than randomly selecting patients for replanning (9). Existing ART models for patient selection show promise but still suffer from limited performance (10,11).
In this study, we develop simple guidelines to select patients for ART (including physician/physicist review of delivered doses, re-CT, refitting of immobilization, and/or treatment replanning), with the objective of decreasing the likelihood of toxicity, poor post-treatment quality of life, and/or tumor recurrence. We use random forest (RF) models to examine which ART objectives are practically achievable (i.e., predictable with reasonable resource use, according to RF capabilities), and further simplify model results using a novel heuristic to develop clinical patient selection guidelines. While full RF models capture the complexity of predictor-response associations, heuristic-based guidelines are more transparent and of a format that is familiar and intuitive for clinical staff. Our hope is that this step towards explicit ART patient selection guidelines will fill an important gap in the ART literature, allow for the formalization of ART trials and improve the reproducibility of clinical ART studies. Furthermore, such a modelling-simplification paradigm as presented in this study is generalizable to a variety of clinical settings that strive to balance the insight gained from complex analyses with the clarity required for clinical implementation.

Patient Inclusion Criteria
The study cohort consisted of 250 head and neck cancer patients treated at a single center with radical VMAT (chemo) radiotherapy (70 Gy/33 fractions) between November 2015 and September 2018. The VMAT technique used 2 arcs of 6 MV photons. Radiotherapy treatment planning objectives for planning target volumes (PTVs) and OAR are provided in Table 1. Patient radiotherapy treatments were planned using the Eclipse Treatment Planning System, Versions 11 and 13 (Varian Medical Systems, Palo Alta, CA). Institutional imageguided radiation therapy protocols used daily kV-orthogonal imaging and weekly kV-cone beam CT (CBCT) imaging. This study was approved by our institutional research ethics board (HREBA.CC-18-0093). Table 2 lists potential predictors identified based on clinical experience and according to measures broadly suggested in the literature. These have been collected from the patients' electronic medical record (EMR), contoured planning CT (pCT), treatment plan (RTx), and rigid alignments of planning CT and lastacquired on-unit CBCT images (Obs). Some measurements, such as changes in brainstem and spinal cord volume, were

Adaptive Radiation Therapy Objectives
We independently considered nine ART objectives of interest, where initial RF models were developed to predict which patients would experience: 1. Increases in brainstem/spinal cord Dmax (whichever structure was planned closer to or farther exceeded the planning objective) -potentially increasing the risk of brainstem necrosis or myelopathy; 2. Increases in parotid gland Dmean for the gland planned with the lowest mean dose -potentially increasing the risk of xerostomia; 3. Increases in pharyngeal constrictor Dmeanpotentially increasing the risk of dysphagia; 4. Increases in submandibular gland Dmean for the gland planned with the lowest mean dosepotentially increasing the risk of xerostomia; 5. Decreases in high-dose CTV D95% target coveragepotentially increasing the risk of tumor recurrence; 6. Increases in high-dose CTV D2% target hotspotpotentially increasing the risk of tissue necrosis; 7. Increases in volume of high-dose CTV -potentially indicating poor treatment response; 8. Decreases in body mass index (BMI)potentially prognosticating poorer overall survival and disease-specific survival; 9. Increases in on-unit patient setup time from the first kVorthogonal image to beam-on, including CBCT-based adjustmentsindicating greater staffing and resource costs.
Although objectives are expected to be correlated, each RF model was developed to predict a specific objective in an attempt to clarify predictor-objective associations. Further detail on the clinical implications of select objectives is provided in Table 3.
An inter-fractional anatomic or dosimetric change potentially increasing the risk of an adverse effects is defined as a "violation" warranting an ART replan assessment. All other changes were considered "normal" (e.g., resulting from minor anatomical changes or variations in patient setup).

Dosimetric ART Objectives
Deformable Image Registration Workflow Quality Assurance Delivered dose was estimated by deformably registering the planning CT and last-acquired CBCT images (Velocity ™ Version 3.2.0, Varian Medical Systems) (24,25), copying the original treatment plan to the resulting contoured "synthetic CT", and recalculating dose (26). Therefore, synthetic CTs combined the clinician contours, field of view, and HU calibration curve of the planning CT with changes in anatomy captured by the last-acquired on-unit CBCT. Quality assurance of the workflow compared DIR output with the consensus contours of two radiation oncologists specializing in head and neck cancer, on a subset of representative images (27)(28)(29). Full details on the quality assurance analysis approach is provided in Supplementary Material -Part 2.

Patient Data Labels: Normal vs. Violation
To formalize normal vs. violation labels for each patient, according to each objective, we established tolerances to distinguish random variations (i.e., resulting from daily setup changes or workflow error) from systematic dose degradations. For this, we additionally analyzed the weekly CBCTs for 10 patients randomly selected from the cohort (65 synthetic CTs), performed a linear fit to each patient's weekly trend data (given the noise in trend data), and calculated the difference between the linear trend and actual objective estimate based on the last-acquired CBCT. Twice the standard deviation of these differences across all patients provided a random error deviation tolerance; violations in objective values exceeding the deviation tolerance were more likely to result from systematic effects. Given the deviation tolerance, we first determined normal vs. violation labels according to "planning criteria violations". For patients with planned doses meeting planning criteria, violations were present if: For patients with planned doses exceeding planning criteria, violations were present if: Secondly, we considered an "as low as reasonably achievable" (ALARA) screening paradigm that applied equation (2) to all patients, correcting, for example, any dose increases above planned values, including consideration of the deviation tolerance.
For comparison, for each of the planning criteria violations and ALARA approaches, we identified the quartile of patients with the worst planning criteria and ALARA violations without consideration of these random/systematic tolerances. Therefore, for each endpoint, we considered four normal/violation formats (planning criteria violations + deviation tolerance; ALARA + deviation tolerance; planning criteria violations + poorest quartile; ALARA + poorest quartile). Additional details and examples of the planning criteria and ALARA violation definitions may be found in Supplementary Material -Part 3.

Clinical and Volumetric ART Objectives
Changes in the volume of the high-dose CTV were calculated from planning and synthetic CTs. Clinical and volumetric objectives had no planning objectives or pre-defined tolerances. Instead, we calculated the deviation tolerance of linearly projected trend values vs. calculated values to give a sense of the relative contribution of random noise in the data. For RF model development, we identified the quartile of patients with the most unfavorable relative changes in objective values (ALARA + poorest quartile formatting).

Training and Validation Datasets
We developed RF models using the first 200 chronological patients (treated November 2015 -January 2018). The subsequent 50 patients (treated January 2018 -September 2018) were reserved for model validation. Cohort characteristics are summarized in Table 4.

Random Forest Modelling
Random forest models were selected for their predictive capability and versatility (30), as well as analogy to clinical decision-making paradigms. Conceptually, these algorithms look at the majority vote of a set of decision trees, similar to an assessment by multiple clinicians.
The RF models used all predictor categories (EMR, pCT, RTx and Obs in Table 2) and combinations of categories to predict the magnitude of a violation for each objective except for #8: decreases in BMI. RF models for the latter excluded the Obs predictor category (already containing DBMI) and used only pre-treatment data (EMR, pCT, RTx). As RF model initialization is stochastic in nature, we used five different random initializations for each model. Receiver operating characteristic (ROC) curves were produced for each model and initialization by incrementally varying the value ("threshold") required to convert a five-fold cross validated numerical violation estimate (regression) to categorial normal/ violation output. A schematic for the prediction of violations using a trained RF model and a sample "toy" input is shown in Figure 1. The point on the ROC curve maximizing the sum of sensitivity and specificity (i.e., maximum Youden index) served as the primary metric for assessing model performance for a given ART objective. Area under the curve (AUC) provided additional information on model performance.
To identify which objectives were most predictable given all combinations of potential predictor sets (EMR, pCT, RTx, Obs) and reference normal/violation paradigms (planning criteria and ALARA violations, with deviation tolerances or poorest quartile),  (1) and (2) for dosimetric objectives.
we used a greedy stepwise approach (31) and Kruskal-Wallis rank-sum tests. Such an approach identified top performing RF models to be heuristically refined to produce simple patient selection guidelines. For each objective, parameters that most clearly differentiated models with strong vs. poor predictive capability according to Youden index were selected first; this parameter was then fixed and the process repeated. When multiple combinations of predictors produced ROC curves with a similar Youden index, we identified the model with the largest set of input parameters (most complete) and the model with the smallest set of input parameters (most parsimonious) for further testing. Of these two, the model obtaining a higher specificity for sensitivity values ranging from 0.60-0.80 was selected. Further details of RF model development and selection is included in Supplementary Material -Part 4. While our sample size is relatively large for ART predictive model development, it is fairly small in the field of machine learning. To consider how sample size may have affected model performance, we further developed models using the first 100, 125, 150 and 175 consecutive patients from the training cohort and assessed five-fold cross validated estimates of sensitivity and specificity.

Heuristic to Derive Simplified Patient Selection Guidelines for ART
To derive simple patient selection guidelines from the RF models, we modified an existing heuristic approach (26). Details of the present heuristic process are provided in Figure 2.
Conceptually, RF models are simplified by determining the values of high-importance predictors (according to mean squared error on out-of-bag samples) at the boundary of normal vs. violation predictions. Combinations of predictor values producing boundary results provided "cutoff" guidelines for patient selection. An explicit example of this heuristic process for the ART parotid gland sparing objective is presented in Supplementary Material -Part 5. Figure 3 summarizes the study design with respect to data collection, guideline development, and guideline validation. All analyses were performed in R (R Version 3.5.1, The R Foundation for Statistical Computing, Vienna, Austria) using the base and randomForest libraries. Figure 4 provides a representative example of the geometric and dosimetric changes in patient anatomy occurring between the planning CT and synthetic CT.

Dosimetric ART Objectives
Deformable Image Registration Quality Assurance DIR and physician contours were geometrically (27) and dosimetrically (28) consistent for all except two anatomical structure types (Supplementary Material -Part 2), validating the DIR workflow used. Exceptions were submandibular glands and high-dose CTV target coverage; as a result, RF models were not developed for the corresponding ART objectives.

Patient Data Labels: Normal vs. Violation
Deviation and quartile tolerances from the trend analysis are included in Table 3 for select ART objectives. Omitted from Table 3 is patient setup time, which did not show systematic trends with progression through treatment. In addition, only 6 of 250 patients had increases in high-dose CTV D2% exceeding planning criteria, creating a dataset with low prevalence. Both FIGURE 1 | Schematic of how the tree-based RF models predict an ART objective violation for a given patient with "toy" values for illustration purposes. Each tree within the model is developed using a random subset of patients in the training dataset. Additional specifications are placed on how each tree is grown (only a random subset of predictors is available to split upon at each tree node). To predict an objective violation for a new patient, patient data is input into the model. An average violation estimate from all trees indicates whether the patient may require a replan assessment.   Table 5 summarizes the achievability and predictor sets required for each of the ART objectives. Of these, for RF models achieving The patient shown was identified as having changes representative of approximately 12% of the training cohort, according to data clustering performed for deformable image registration quality assurance. Axial slices correspond to: 1) the centers of mass of the parotid glands; 2) centre of mass of the highdose PTV; 3) centre of mass of the pharyngeal constrictor, assessed for the planning CT and rigid alignment of the synthetic CT. A dose color wash indicates doses ranging from 95% of the maximum allowable spinal cord dose, to 105% of the high-dose prescription. Anatomical structure contours are overlaid. Notably, the patient experienced weight loss, loss of parotid gland volume, and a general increase in doses to healthy tissues.

6) Increase in high-dose CTV D2%
No (too few patients with violation to produce a predictive model) - AUC≥0.75, Figure 5 shows ROC curves averaged over the five random initializations. In general, factors most affecting model performance included: predictor set combinations (EMR, pCT, RTx, and/or Obs), followed by normal/violation formatting (planning criteria vs. ALARA violation; deviation tolerance vs. poorest quartile). Models based on planning criteria violations outperformed those based on the ALARA paradigm. Furthermore, for dosimetric objectives, models developed using deviation tolerances outperformed those identifying the quartile of patients with the largest violations.
Youden index decreased for the validation dataset, as expected, with an average decrease across all objectives of 0.12. This behavior generally occurs due to slight model overfitting on training data (10).
Constraints on training cohort size did not appear to limit RF model results. Average AUC only increased by 1% when doubling the size of the training dataset from 100 to 200 patients. However, the standard deviation of AUC for the five random initializations of each model decreased by an average of 44%. Table 5 gives the simple patient selection guidelines and performance on the validation dataset for the achievable ART objectives. The percentages of patients indicated for replan assessment were: 28% for brainstem/spinal cord; 33% for parotid glands; 58% for pharyngeal constrictor; and 49% for weight loss. For the simplified criteria, Youden index on the validation dataset increased by an average of 0.15 compared to the training dataset.

Heuristically Simplified Patient Selection Guidelines
Although some of the top performing models included elements from the EMR and Obs input categories, these could be removed from the simplified criteria with only minor losses in sensitivity and specificity. For the brainstem/spinal cord Dmax objective, DNeck diameter ≥5mm was originally included in the patient selection criteria. For the pharyngeal constrictor Dmean objective, the heuristic retained DFace diameter ≥6mm and bilateral treatment. For the latter, all patients planned with a contralateral parotid gland Dmean exceeding 19 Gy received bilateral treatment, and the redundant EMR parameter was removed. Furthermore, removing the on-unit measurements (DNeck diameter, DFace diameter) reduced specificity by 0.06 for both brainstem/spinal cord and pharyngeal constrictor objectives. The moderate reduction in performance may have significant gains in overall ART workflow streamlining as further examined below.

DISCUSSION
This study shows that RF modelling may be used to examine complex data associations, where results may be heuristically simplified to produce clinical guidelines for clinicians that are familiar and intuitive. Previous studies have aimed to predict various ART objectives (10,11,32,33). While a comparable model in the literature predicting parotid gland dose increases achieved specificity of 0.25 for sensitivity of 0.80 on a validation dataset (10), our models and simplified guidelines have achieved promising specificity of approximately 0.70 (sensitivity ≥0.80). In addition, our ART patient selection targets a smaller number of patients for replan assessment (28-58%) compared to 58% to 77% for parotid gland objectives previously published (10,32). Combining patient selection criteria from our study for brainstem/spinal cord, parotid gland, and pharyngeal constrictor objectives corresponds to ART referral for 65% of patients. While replanning 65% of patients may currently be too resource costly for rollout in busy clinics, the cost-benefit tradeoff for brainstem/spinal cord or parotid gland sparing may be more feasible. It may be possible to further refine pharyngeal constrictor and weight loss models by evaluating modified objective criteria (e.g., besides Dmean ≥50 Gy), although this falls outside of the scope of the present work and QUANTECmotivated constraints.
By removing on-unit measurements in the simplified patient selection criteria, ART workflow streamlining may be considerably improved. The brainstem/spinal cord Dmax objectives indicate the most conservative gains from workflow streamlining where removing on-unit measurements resulted in 13 more false positive replan indications for the full study cohort over 35 months. However, on-unit image registration and measurements for the cohort are estimated to take 275 person-hours total (approximately 2 minutes/patient), significantly longer than re-CT and dose recalculation for the 13 false positive cases.
The simplified criteria for dosimetric objectives contain anatomically unrelated OARs, indicating correlations with plan quality, where the proximity of target volumes to OAR may have increased OAR doses. In keeping with general treatment planning principles, healthy tissues doses likely were distributed among multiple OAR in an attempt to meet treatment planning criteria. For example, patients appear to be at risk of increased parotid gland dose given high initial parotid gland doses as well as high planned brainstem and spinal cord doses. The "AND" format of the simple patient selection guidelines is well-suited to capture these complex effects and reflects the underlying nature of RF algorithms.
In practice, we expect that the simple patient selection guidelines will be most efficiently implemented using basic treatment planning system scripting capabilities, and ultimately, that patient data may be continuously incorporated into RF model development via an auxiliary workflow. However, the simple guidelines are amenable to be pinned to a dosimetrist or booking clerk's wall for reference. The RF models and simplified guidelines were developed specifically for our institution's cohort and treatment practices; application to other practices must be carefully reviewed. For example, our center's radical (chemo) radiotherapy approach for these patients used two dose levels (high-dose CTV = 70 Gy, low-dose CTV = 59.4 Gy). Although this is a common practice, some centers may treat primary disease, high-risk and low-risk lymphatics with three dose levels, potentially affecting the incidence of OAR dose violations. While not statistically significant, slight improvements of the simplified criteria over full RF models may result from the simpler nature of the criteria (i.e., lower variance), and/or possible improvements in our institutions patient planning, immobilization, and on-unit image guidance. Although we strived to produce a comprehensive set of ART objectives, it is not exhaustive and some objectives, such as losses in CTV coverage, could not be modelled due to DIR workflow errors specific to delineation of this anatomical structure.
A further limitation of this study is the use of last-acquired CBCT images for each patient to characterize during-treatment anatomical changes. This approach was motivated by the high resource costs associated with aggregating data for the study cohort, mainly arising from the manual inputs required for image DIR between planning CT and CBCT images. Assuming that patient anatomy was like the last-acquired CBCT for all images overestimates the clinical benefit of ART. However, as our focus is patient selection for ART, the greater "signal" of these images has been used to increase the ability of models to detect anatomical/dosimetric changes. In addition, this approach allowed us to produce a larger and more diverse patient cohort with the aim of developing robust ART models, as compared to processing multiple images per patient.
The timing of ART replanning is generally recommended during the first three weeks of treatment (3), however, timing may vary by objective. Although replan timing falls beyond the scope of the present study, is the focus of ongoing work.
The study design presented may be used to develop ART patient selection criteria for other sites, such as lung, cervix, and anal canal patients. Selection of patients for ART assessment are expected to vary depending on the number and proximity of OARs, and nature of acute toxicities and random vs. systematic interfractional anatomical changes.

DATA AVAILABILITY STATEMENT
The datasets presented in this article are not readily available because the conditions of ethics approval do not currently allow data sharing at this time. Requests to access the datasets should be directed to Sarah.Weppler@albertahealthservices.ca.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the Health Research Ethics Board of Alberta. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.