Inter-Observer Variation in Delineating the Pharyngeal Constrictor Muscle as Organ at Risk in Radiotherapy for Head and Neck Cancer

Background and Purpose To evaluate the inter-observer variation (IOV) in pharyngeal constrictor muscle (PCM) contouring, and resultant impact on dosimetry and estimated toxicity, as part of the pre-trial radiotherapy trial quality assurance (RTQA) within DARS, a multicenter phase III randomized controlled trial investigating the functional benefits of dysphagia-optimized intensity-modulated radiotherapy (Do-IMRT) in pharyngeal cancers. Methods and Materials Outlining accuracy of 15 clinicians’ superior and middle PCM (SMPCM) and inferior PCM (IPCM) were retrospectively assessed against gold standards (GS) using volume, location, and conformity indices (CIs) on a pre-trial benchmark case of oropharyngeal cancer. The influence of delineation variability on dose delivered to the constrictor muscles with Do-IMRT and resultant normal tissue complication probability (NTCP) for physician-scored radiation-associated dysphagia at 6 months was evaluated. Results For GS, SMPCM, and IPCM volumes were 13.51 and 1.67 cm3; corresponding clinician mean volumes were 12.18 cm3 (SD 3.0) and 2.40 cm3 (SD 0.9) respectively. High IOV in SMPCM and IPCM delineation was observed by the low DICE similarity coefficient value, along with high geographical miss index and discordance index values. Delineation variability did not significantly affect the mean dose delivered to the constrictors, relative to the GS plan. Mean clinician NTCP was 24.6% (SD 0.6), compared to the GS-NTCP of 24.7%. Conclusions Results from this benchmark case demonstrate that inaccurate PCM delineation existed, even with protocol guidelines. This did not impact on delivered dose to this structure with Do-IMRT, or on estimated swallowing toxicity, in this single benchmark case.


INTRODUCTION
Irradiation of the pharyngeal constrictor muscle (PCM) is implicated with post-radiotherapy (RT) dysphagia in head and neck cancer (HNC), resulting in increased risks of aspiration, prolonged feeding tube dependency, and worsened health-related quality of life (1,2). Sparing RT dose to this critical dysphagia/ aspiration at risk structure (DARS) is paramount to improve longterm swallowing function. The successful implementation of swallow-sparing RT techniques in HNC is therefore reliant on contouring accuracy of this critical swallowing organ at risk (SW-OAR) to facilitate optimal avoidance during RT planning. DARS (CRUK/14/014) is a phase III randomized controlled trial in the UK that is currently investigating the functional benefits of reducing dose to the constrictors with dysphagia-optimized intensity-modulated RT (Do-IMRT), relative to standard IMRT, in cancers of the oropharynx and hypopharynx (3). Heterogeneity in PCM definition among clinicians within the study may lead to erroneous interpretation of RT-related morbidity, and consequently affect the assessment and interpretation of the primary endpoint of the study. In addition, variable contouring may lead to inaccurate correlation between PCM dose-volume parameters and radiation-associated morbidity, and any subsequent parameters generated for predicting swallowing toxicity may be misleading (4,5).
As part of the RT quality assurance (RTQA) program for DARS, clinicians were expected to successfully complete a pre-trial contouring case before enrolling patients in the study at their centers. Our aims in this study were to analyze the differences in PCM delineation between head and neck oncologists within the context of this pre-trial contouring program, evaluate the dosimetric impact of inter-observer variability (IOV) with Do-IMRT, and lastly, to determine the clinical impact of outlining variability on estimated swallowing toxicity.

DARS Pre-Trial Contouring RTQA Program
The pre-trial quality exercise included a contouring test case with T2N2c base of tongue tumor (AJCC 7 th edition), in which clinicians from 15 centers were required to delineate the clinical target volumes (CTV) and OARs, including superior and middle PCM (SMPCM) as one structure and inferior PCM (IPCM) as a separate structure. Do-IMRT planning was not required on the pre-trial contouring test case; a separate pre-trial planning test case with pre-outlined CTVs and OARs was supplied to participating centers, who were expected to submit a protocol-compliant Do-IMRT plan. The DARS trial RT protocol document described in detail the RTQA process for outlining and planning to facilitate the delivery of high-quality RT within the study. In particular, there was a comprehensive section on PCM delineation, which was based on the guidelines by Christianen et al. (6), and the slice-by-slice contouring atlas produced by the PATHOS RTQA team (7). Centers downloaded the planning computed tomography (CT) scan dataset, with gross tumor volume pre-outlined, in digital imaging and communications in medicine-RT from the RTQA website. All completed cases were reviewed by the DARS RTQA team. Each submission was visually evaluated by the chief investigator to determine whether it conformed to the requirements of the trial protocol, and were classified as "per protocol," "acceptable variation with comments for future cases," or "unacceptable variation." Individualized feedback, as per the "Global Harmonization Group" guidelines (8), was subsequently provided to each clinician along with either an approval or a request for resubmission of contours. Participating centers were only permitted to recruit patients after successful completion of the pre-trial QA exercises.

Contour Analysis
This study was a retrospective quantitative and qualitative analysis of variation in PCM delineation from the initial submission of 15 clinicians, relative to a gold standard (GS) PCM contour, in order to evaluate the IOV that would have existed for this novel structure if a pre-trial quality assurance program did not exist. Re-submitted contours were not evaluated in this study and will form part of another study. The GS in this study was created by a senior radiation oncologist who was part of the panel of international experts that developed and published the consensus guidelines for CT-based delineation of OARs, including the PCM, in HNC. The completed test case outlines were exported to the research version of RayStation treatment planning system (version 5.9.9, RaySearch Medical Laboratories, AB Stockholm, Sweden) for analysis within this study. IOV was assessed using whole volume assessment, surface-based mean and maximum distance to agreement (DTA) (9), and volume-based conformity indices (CIs). These metrics were written in python programming language and implemented in RayStation as a script that could be executed for each study dataset. The following CIs were retrospectively evaluated to determine the concordance between clinician and GS contours (Supplementary Figure 1): • Dice similarity coefficient (DSC): reflects the overall agreement between the volumes of two contours. An ideal score is 1, indicating perfect overlap with the GS contour (10) (9). A score of > 0.7 is considered to represent good agreement between two contours (11)(12)(13). • Geographical miss index (GMI): indicates the amount of GS contour not included in the clinician contour. An ideal score is 0, implying no "under-contouring" (14). • Discordance Index (DI): indicates the amount of clinician outlining not included in the GS contour. An ideal score is 0, indicating no "over-contouring" (15).
Contouring variation for the brainstem and parotid glands, 2 routinely delineated OARs in HNC, were also determined to serve as a useful comparator for the constrictors.
In addition to whole-volume conformity analysis described above, a slice-by-slice CIs evaluation of clinician PCM (slice DSC (s-DSC), s-GMI etc.) was additionally carried out (Supplementary Figure 2) to identify volume variation on a slice-by-slice basis of the constrictor muscle delineation (14), using the equation described in Supplementary Figure 1. Positional variation on each slice was additionally established by evaluating the maximum distance from the surface of GS delineation to the clinician contour in the anterior, posterior, right lateral, and left lateral direction on each slice.
These metrics were not used as tools to provide feedback for submissions within the real-time pre-trial RTQA and were solely used for the purpose of this study.

Dosimetric Analysis
Centers were not expected to generate Do-IMRT plans for the pretrial contouring test case. A three-step methodology was therefore adopted to quantify the dosimetric impact of IOV in PCM contouring for the test case, as shown in Figure 1. In step 1, GS mean dose to the constrictors was determined by generating a GS Do-IMRT plan using GS target volumes and OARs including SMPCM and IPCM. This was the reference plan against which clinician plans were compared. In step 2, 15 clinician Do-IMRT plans based on individual clinician's delineation of the constrictor muscle were created in order to determine corresponding mean doses. For these plans, GS target volumes and non-swallowing OARs were used for RT optimization, rather than clinician volume delineation. This step facilitated the evaluation of possible dosimetric impact that could be attributed only due to differences in PCM definition by the 15 oncologists. In step 3, GS-SMPCM and GS-IPCM structure sets were superimposed on clinician RT plans constructed in step 2, and the mean dose delivered to the GS contours on these plans was derived. This step allows the evaluation of whether the dose to the PCM on RT plans created using clinicians' definition of the constrictor muscle represents what the GS delineation receives. Measuring this outcome is relevant to study, as it is possible that the reported dose to this critical swallowing OAR may not be a true reflection of dose received in the presence of contouring errors, and therefore subsequently reported toxicity outcomes may be inaccurate.
The Do-IMRT planning technique of DARS for oropharyngeal tumors has been previously described elsewhere (3). In brief, the technique aims to spare dose to the constrictors by setting a mandatory mean dose of < 50 Gy to the volume of SMPCM (PlanSMPCM), together with an optimal constraint of < 20 Gy to the volume of IPCM (PlanIPCM) lying outside the high dose clinical target volume. A dose of 65.1 Gy in 30 fractions over 6 weeks was to be delivered to the therapeutic planning target volume (PTV1), and 54 Gy in as many fractions to the prophylactic PTV2.
The GS and clinician RT plans were generated with volumetric-arc therapy, consisting of two 360°arcs with mirrored collimator angles of 30°and 330°respectively, and optimized using the collapse cone v3.4 algorithm in RayStation. The planning objectives and optimization process used for each clinician plan was similar to that used for the reference GS plan.

Predicted Swallowing Toxicity Analysis
The normal tissue complication probability (NTCP) for physicianscored RTOG > grade 2 radiation-associated dysphagia at 6 months with Do-IMRT was determined by applying the predictive model of Christianen et al. (16)(17)(18), in which mean dose to the superior PCM and supraglottic larynx were predictors of toxicity. Following on from the methodology used to determine the dosimetric impact of IOV in contouring, three swallowing toxicity models were accordingly calculated-GS-NTCP, based on GS Do-IMRT plan; clinician NTCP based on their plans; and lastly the estimated risk of dysphagia when the reference GS contours were superimposed on the investigator RT plans.

Statistical Analysis
Analysis was performed using Statistical Package for the Social Sciences (SPSS) version 25. Variables with normal distribution were reported as mean and 95% confidence interval (95% CI), while those not normally distributed were reported as median and interquartile range (IQR). One sample t-tests were calculated for GS dosimetry and estimated toxicity to assess for clinician variation.

RESULTS
GS-SMPCM and GS-IPCM volumes were 13.5 and 1.7 cm 3 respectively. Clinicians' mean SMPCM and IPCM volumes   Figure 3), and none for SMPCM contouring ( Figure 2). The GMI values indicated that a mean of 6.3 cm 3 (range 3.2-8.0 cm 3 ) and 0.5 cm 3 (range 0.2-0.9 cm 3 ) of the GS-SMPCM and -IPCM contours were outside the clinicians' outlining respectively. In other words, on average 46.6 and 30.0% of GS-SMPCM and -IPCM volumes were not included in the clinicians' delineation. The DI values, particularly for IPCM, imply substantial over-contouring. For 11 (73%) SMPCM and 3 (20%) IPCM contours, the maximum DTA was > 1 cm relative to the corresponding GS contour. In comparison, there was good agreement for the non-swallowing OARs, with DSC of > 0.80 for both parotids and BS ( Table 2).   The estimated risk of dysphagia is shown in Figure 5. GS-NTCP was 24.7%. The difference between GS and clinician mean NTCP was 0.1% (95% CI 24.3-25.0, SD 0.6; p= 0.7); corresponding difference between the GS-NTCP and when the GS contour was superimposed on clinician plans was 0.3% (95% CI 23.7-25.0, SD 1.1; p= 0.3).

DISCUSSION
To our knowledge, this is the first study to explore variation in PCM delineation, and its impact on predicted swallowing toxicity, in the UK. We have shown that clinicians' conformity to the GS volume for both SMPCM and IPCM was poor with the first submission, as evidenced by the variable whole volumes where there was 1.5-fold and 3.4 fold-difference between clinicians' volumes respectively, low DSC and high DI and GMI scores. Whole-volume CIs, however, do not provide sufficient information about differences in size, shape, or location that may exist between 2 volumes. Similar CIs values for different contours, therefore, do not necessarily indicate that the contours are identical. For instance, one clinician achieved a DSC of 0.65 (ranked 1 st of 15), GMI of 0.23 (ranked 1 st of 15), but a DI of 0.43 (ranked 11 th of 15) for SMPCM delineation. Visual assessment of the contours, however, showed that the delineation did not extend laterally to encompass the pterygoid muscle as specified in the trial protocol. On the other hand, no protocol violation was identified for another clinician who scored a DSC of 0.62 (2 nd of 15), GMI of 0.34 (3 rd of 15), and DI of 0.43 (10 th of  15) for SMPCM delineation. Outlining errors for the constrictor muscles may therefore be missed if whole-volume CIs alone were used to establish levels of agreement between contours. The addition of slice CIs provides a quantitative, and more objective, evaluation by facilitating the identification of slices of disparity between clinician and gold standard, which might lead to more robust analysis. The s-CIs values for clinician IPCM delineation observed in this study imply that the relatively poor corresponding whole volume CIs values were largely due to uncertainty in defining the superior and inferior extent of this structure.
Our study also showed that systematic delineation errors occurred despite the presence of a detailed contouring protocol and delineation atlas. For instance, three clinicians wrongly assumed the caudal edge of cricoid cartilage as the inferior border of the IPCM. Spatial assessment for SMPCM delineation additionally demonstrated that concordance with the GS contour was poor in the middle section of this structure, where the lower s-GMI and s-DSC compared to the mean overall GMI and DSC suggested under-outlining as the contouring error. Visual assessment of the discordant slices identified that under-outlining was often due to failure to extend the delineation of SMPCM laterally to encompass the pterygoid muscle.
Certain factors may have influenced the poor PCM CIs values, relative to GS. In contrast to the brainstem and parotids where CT provides sufficient soft tissue contrast for delineation, the PCM is not readily visualized on CT and its contouring is therefore reliant on accurate interpretation of guidelines based on different anatomical landmarks, which is likely to have contributed to the higher degree of variation observed in this study. For instance, the cranial and caudal extent of PCM was subject to substantial IOV implying uncertainty in identifying the tip of the pterygoid plates and the lower edge of the arytenoid cartilages, which may be due to unfamiliarity with identifying these on CT. It is also pertinent to consider the relatively smaller volume of the constrictors relative to the standard OARs when interpreting the differential CIs values. CIs are more sensitive to the smaller volumes, as a few missing or extra voxels on one contour is sufficient to skew their values. On the other hand, they are more forgiving for larger volumes such as the parotids where a relatively larger variation is required to demonstrate a comparable CIs result.
There are only a few studies that have investigated PCM contouring variability. Feng et al. found significant IOV among three clinicians in fractional overlap (intersection volume divided by union volume) for PCM (mean 0.5), when the muscle was delineated on three separate occasions (19). Alterio et al. additionally showed that there was increased intra-and inter-observer variability in delineation of the superior pharyngeal constrictor muscle, along with lower adherence compared to the corresponding MRI-contoured muscle, among 34 HN oncologists (20); the study group did not assess the dosimetric impact of IOV. It is difficult to make comparisons  with the above studies, due to differences in the respective methodologies and delineation guidelines. Our work has not only identified that IOV for contouring of PCM existed, similar to the published literature, but also established the areas of maximum variation from the reference contour within the study population. The described measurements of IOV in this study were not used during the DARS pre-trial RTQA, where feedback to the clinicians was based on visual evaluation of their submissions by the quality assurance team. Implementing such measurements in addition may lead to targeted analysis of areas of high discordance, and facilitate the introduction of semiautomated assessment measures (15).
The PCM often falls in the region of high dose and steep dose gradients. Inaccuracy in the contouring of this swallowing OAR could potentially under-report the mean dose received if the voxels are erroneously placed outside of the high dose region, or have the converse effect if extra voxels are incorrectly placed in the high-dose regions. We therefore studied two surrogate clinical outcome measures, namely differences in dosimetry and estimated risk of swallowing toxicity at 6 months, to determine the impact of any contouring variation in the constrictor muscle on subsequent toxicity burden, relative to the reference contour. Despite establishing volumetric, overlap, and spatial variability in contouring of the PCM, we found that there was minimal impact on the mean dose delivered to this structure with Do-IMRT and risk of persistent swallowing dysfunction compared to GS. Such an outcome would suggest that variability in the delineation of this swallowing OAR does not impact on the dose delivered with Do-IMRT, which would be consistent with results of Feng et al. and that pre-trial contouring QA for this structure may not be necessary (21). Before drawing firm conclusions to that effect, it is pertinent to consider certain limitations in this study. This analysis was conducted on a single benchmark case with minimal target volume-PCM overlap, and it is possible that the clinical outcomes with PCM contouring variability could differ with increasing number of cases and/or greater overlap. Furthermore, the ball diameter used to contour the PCM with certain clinicians was wider than the 3 mm used for the GS contour; at the time of DARS pre-trial exercise, there was no agreed consensus about the width of this muscle for the purpose of delineation. Consequently, there was a larger dose gradient on their plans relative to the GS plan, explaining why the mean doses to the GS on some plans was smaller. Variability in supraglottic larynx delineation was not assessed in this study and it remains possible that outlining uncertainties for this structure may lead to different toxicity outcomes than the one presented in this study. Finally, the NTCP model applied in this study was not validated for the RT treatment technique used here.
In this study, an "expert-defined" gold standard was used as the benchmark contour, against which all contours were compared. Therefore, there may be an element of bias introduced into our results. Currently, there remains no consensus regarding definition of a gold standard volume within the context of pretrial quality assessment, with published studies choosing between GS contour such as in this study, or a mathematically derived consensus contour. Similarly, there could be a debate about the reproducibility of our GS Do-IMRT plan; however the same would hold true for the clinician Do-IMRT plans too. The intent of this study was to examine the IOV and subsequent dosimetric and clinical impact, and we feel the possibility of OAR and plan variability would always remain irrespective of the chosen reference structure and plan. We did not analyze the differences in dose delivered to the constrictors with standard IMRT and Do-IMRT for each clinician outlining. This was not the aim of this study, and therefore the potential impact of delineation variability on dose delivered to the two arms of DARS trial, and consequent implications on trial results, cannot be determined.
In conclusion, qualitative and quantitative assessments demonstrated considerable IOV in the delineation of the PCM on a single pre-trial benchmark case, due to a combination of inaccurate interpretation of the contouring protocol and unfamiliarity with radiological landmarks. The inconsistent definition of PCM did not have a detrimental impact on dosimetry or estimated toxicity, but it is premature to make such a conclusive assumption on a single test case alone. Future work would involve analysis of contouring from standard and Do-IMRT plans of treated trial patients and associations with clinical toxicity outcomes.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

AUTHOR CONTRIBUTIONS
IP-first author and corresponding author for this manuscript. Original idea for this research, data collection, and analysis, wrote the first draft and final version of this manuscript. DM-data collection, revised draft manuscript and approved final version of the manuscript. AD-data collection, revised draft manuscript and approved final version of the manuscript. JT-data collection, revised draft manuscript and approved final version of the manuscript. EH-data analysis, revised draft manuscript and approved final version of the manuscript. CN-senior author, data analysis, revised draft manuscript and approved final version of the manuscript. All authors contributed to the article and approved the submitted version. DARS trial. The authors would like to thank the Trial Management Group and the clinicians for their contribution to the trial. CN and SB acknowledge research funding from CRUK (C7224/A13407). The research fellowship of IP was supported by a grant from Oracle Cancer charity trust. This project represents independent research supported by the National Institute for Health Research (NIHR) Biomedical Research Centre at the Royal Marsden NHS Foundation Trust and the Institute of Cancer Research, London. The views expressed are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care.