A real-time contouring feedback tool for consensus-based contour training

Purpose Variability in contouring structures of interest for radiotherapy continues to be challenging. Although training can reduce such variability, having radiation oncologists provide feedback can be impractical. We developed a contour training tool to provide real-time feedback to trainees, thereby reducing variability in contouring. Methods We developed a novel metric termed localized signed square distance (LSSD) to provide feedback to the trainee on how their contour compares with a reference contour, which is generated real-time by combining trainee contour and multiple expert radiation oncologist contours. Nine trainees performed contour training by using six randomly assigned training cases that included one test case of the heart and left ventricle (LV). The test case was repeated 30 days later to assess retention. The distribution of LSSD maps of the initial contour for the training cases was combined and compared with the distribution of LSSD maps of the final contours for all training cases. The difference in standard deviations from the initial to final LSSD maps, ΔLSSD, was computed both on a per-case basis and for the entire group. Results For every training case, statistically significant ΔLSSD were observed for both the heart and LV. When all initial and final LSSD maps were aggregated for the training cases, before training, the mean LSSD ([range], standard deviation) was –0.8 mm ([–37.9, 34.9], 4.2) and 0.3 mm ([–25.1, 32.7], 4.8) for heart and LV, respectively. These were reduced to –0.1 mm ([–16.2, 7.3], 0.8) and 0.1 mm ([–6.6, 8.3], 0.7) for the final LSSD maps during the contour training sessions. For the retention case, the initial and final LSSD maps of the retention case were aggregated and were –1.5 mm ([–22.9, 19.9], 3.4) and –0.2 mm ([–4.5, 1.5], 0.7) for the heart and 1.8 mm ([–16.7, 34.5], 5.1) and 0.2 mm ([-3.9, 1.6],0.7) for the LV. Conclusions A tool that uses real-time contouring feedback was developed and successfully used for contour training of nine trainees. In all cases, the utility was able to guide the trainee and ultimately reduce the variability of the trainee’s contouring.


Introduction
Radiotherapy represents a balance between local tumor control and minimizing toxicity to normal tissues.Treatment plans are designed so that the prescription dose is delivered to the target while minimizing dose to nearby organs at risk (1,2).Radiotherapy treatment planning begins with accurate delineation of both the target volume and organs at risk.Previous studies have shown substantial variations exist in the contouring process, including both intra-observer and inter-observer variations (3)(4)(5)(6)(7)(8).These variations usually result from differences in training on how to generate contours and can be significantly influenced by the image quality of the contouring dataset.This is particularly true when contouring organs with low contrast relative to surrounding tissues.Previous studies that analyzed inter-observer variability in contouring suggested the need for consensus training in contouring (4).Large clinical trials also called for consistent contouring across different institutions to produce meaningful outcomes in analyses of treatment-related toxicity (9).
Traditional contour training usually involves experienced attending radiation oncologists providing feedback to trainees via interactive teaching tools.This approach requires a significant commitment from physicians to their clinical workload; the feedback provided is often delayed, sometimes by several days.Such delays may reduce the effectiveness of contouring training.This interactive training approach is also more subjective than objective.Various online contouring training tools have been developed to address these shortcomings [e.g., eContour (10) and EduCase (11)], but most of these tools do not give meaningful feedback on how well a trainee is contouring and often rely on assumed "gold standard" contours.Errors in "gold standard" contours could introduce bias, especially for low-contrast organ contours, and may not be helpful for trainees to improve their contouring skills.Although quantitative metrics are available in these tools to characterize contouring performance, they usually do not include any spatial or shape information of the organ being evaluated.As a result, these contour training tools cannot tell the trainee specifically where to adjust the contour to improve consistency.
In this study, we developed a software tool for consensus contouring training that provides real-time feedback on contouring performance without the presence of radiation oncologist staff who traditionally fill this role.We developed a new quantitative metric containing spatial information for analysis of inter-observer variability that can guide the trainee to the specific location where contours need to be adjusted.This new metric enables real-time feedback on contouring performance to the trainee so that they can continuously practice contouring without interruption.We propose that our tool can improve training efficiency by providing real-time feedback without the need for experienced radiation oncologists to be present.Not only does this benefit modern radiation oncology clinics, where radiation oncologists' time is at a premium, but this tool can also be useful in low-to middle-income countries, which often have a great need for radiation oncology staff trained in contouring but resources are limited (12)(13)(14).

Overview of contour training tool
A contour training software utility was developed to provide real-time feedback on contouring to the user (referred to here as the trainee).The utility serves as a full contouring package, as it includes contouring tools commonly found in commercial treatment planning systems.The utility can also provide real-time contouring feedback to the trainee while they contour a region of interest or a specific organ.This real-time feedback can guide the trainee to specific spatial locations where contours need adjustment to improve consistency.This utility stores numerous contours from expert radiation oncologists that are used to quantify the consensus-contouring performance of the trainee.This real-time contouring feedback is expected to improve the trainees' skills in consensus contouring.

Localized signed surface distance (LSSD)
A new quantitative metric was developed for real-time contouring feedback called localized signed surface distance (LSSD), which is based on mean surface distance.Specifically, for one structure, the disagreement between the trainee contour (T) and the reference contour (R) is examined within each slice.In one slice, first the geometric center of the reference contour is determined.Then the entire space in the slice is divided into N = ½ 2p Dq sectors, with each sector spanned by an angle of Dq, as shown in Figure 1.In each sector, the mean distance jDdj between the section of reference contour, DR, and the section of manual contour, DT, is calculated as: The volumes enclosed by DR and DT in the sector are then used to calculate the sensitivity (P) and specificity (Q) for the manual contour with regard to the reference contour as: The sensitivity and specificity are used to determine a positive or negative distance for this sector: A positive distance suggests that the trainee's contour is larger than the reference contour or that the observer tends to draw contours that are larger than they need to be at this specific location.On the other hand, a negative distance suggests that the trainee's contour is smaller than the reference contour or that the observer tends to draw contours that are smaller than what they should be at this specific location.Our LSSD definition is similar to the distance deviation measure proposed by Rogelj et al. (15) and radial distance proposed by Sebastien et al. (16); however, our LSSD is a signed distance to indicate over-or under-contoured.The distance information of all sectors and slices is transformed to an LSSD map.The signed distance is colorized to easily identify the disagreement.In the common head-first supine setup, the angle q of 0°, 90°, 180°, and 270°represents left, anterior, right, and posterior locations, respectively.After the trainee completes contouring a structure, the LSSD map is updated, which provides the real-time feedback on contouring performance to the trainee.With the LSSD map, the trainee can quickly identify the spatial locations of inconsistent contours.Also, an appropriate threshold can be applied to the LSSD map to emphasize large variations.

Reference expert contours
Six training cases involving contours of the heart and left ventricle (LV) were used to validate the effectiveness of the training tool.These training cases were adopted from a set of atlases that were developed for an automatic multi-atlas contouring system (17).The heart and LV are important structures to spare dose for cardiac toxicity control during radiotherapy planning (18)(19)(20).Studies have found that inconsistent contouring of these structures has greatly compromise the toxicity control (9).In this study, the heart was chosen to represent a relatively easy structure for consistent contouring while the LV was chosen to represent a relatively difficult case because of its low contrast to other heart chambers (21).For each training case, eight experts specializing in either thoracic radiation oncology or lymphoma radiation oncology delineated the heart and left ventricle individually according to the RTOG (Radiation Therapy Oncology Group) 1106 organ-at-risk contouring guideline (22) and a published cardiac atlas consensus contouring guideline (23).The contours were drawn on noncontrast CT images in the Pinnacle treatment planning system (Philips Medical Systems, Fitchburg, WI), with corresponding contrast CT images rigidly fused to constitute the reference.The manual contours of the eight experts and the CT images for all 6 cases were exported from the treatment planning system and imported into the stand-alone contour-training tool.These expert contours were used to generate the reference contours for LSSD map computation, as described in the next section.

Contour training software interface
Within the training software interface, the trainee is first prompted to select a training case and training structure (heart or LV).Once the training case is loaded into the software interface, only the CT image is shown.The trainee first creates and contours the entire region of interest on the CT scan.Behind the interface, the software creates a reference contour by fusing the trainee contour with those of the eight experts by using the simultaneous truth and performance level estimation (STAPLE) algorithm (24,25).The STAPLE algorithm is based on the maximum likelihood estimates of the true positive and false negative of individual contours.It estimates the best agreement among individual contours and produces a consensus contour (reference contour) that best represents the underlying anatomy.Neither the expert contours nor the reference contour is displayed to the trainee any time.The trainee contour is then compared with the reference contour by using the LSSD metric to produce an LSSD map, which is then displayed to the user in the form of a 2D color map beside the contouring interface.The 2D LSSD maps are organized vertically by the CT slice, and horizontally by the sector angle, as illustrated in Figure 2. A positive LSSD indicates that the trainee contour is beyond the reference contour within that sector and slice, and a negative LSSD indicates that the trainee contour is within the reference contour in that sector and slice.
After the trainee completes the initial contour, the LSSD map can be updated as often as needed while the contours are revised.Graphic illustration of the LSSD algorithm.Illustration of the quantitative metric with spatial information, the localized signed surface distance (LSSD).The entire space in one slice is separated into small sectors, with each sector spanned by an angle of Dq with the origin at the geometric center of reference contour (R).In each sector, the mean distance Dd between the piece of reference contour, DR, and the piece of trainee contour (T), DT, is calculated.The sensitivity and specificity of trainee contours are calculated by using the volumes enclosed by DR and DT, and they are used to determine a positive or negative distance for this sector.The distance information from all sectors and slices is then transformed to an LSSD color map to demonstrate variations in contouring.
The trainees were instructed to update the reference contour periodically during the process.An updated reference contour (consensus contour) is recreated using the updated trainee contour and the stored expert contours through the STAPLE algorithm.The LSSD map is interactive in that one can select an LSSD unit, and the contouring interface will advance to the corresponding slice and highlight the sector of interest.As the trainee begins modifying their contours, the LSSD map is updated with the current value.LSSD values near zero are displayed as green; LSSD values of ≥+3 mm are displayed as red; and LSSD values of ≤-3 mm are displayed as blue.The LSSD map displayed to the trainee then saturates such that deviations in LSSD larger than 3 mm are displayed as red and blue.As the user adjusts their contour, the LSSD colormap is updated to indicate the trainee contour compared with the updated reference contour.Initial and LSSD colormaps as the trainee progresses through contour training are shown in Figure 3.As the trainee progresses through training, the overall color of the LSSD map shifts towards green, i.e., an LSSD of zero.

Contour training sessions
Nine trainees were recruited for contour training using this software tool to evaluate the training process and its effectiveness.The trainees were medical physics staff with some knowledge of human anatomy but had not necessarily been trained in anatomy delineation from CT images.Each participating trainee was assigned six training cases, and one of those six cases was used for the retention case.The trainees were instructed to contour each case in an assigned order, which was chosen randomly to eliminate variation in contouring on a case-by-case basis.After each trainee had contoured the six training cases, the trainee waited 30 days to contour the retention case, to test whether the learned contouring skills were retained after a break in using the software.As noted, the retention case was the last case contoured in each trainee's training session.
The LSSD maps were saved during the contour training sessions to quantify the effectiveness of the training.Each grid of the 2D LSSD map has a value representing the distance of the trainee contour from the reference contour within that particular slice and sector.A histogram of the LSSD values was generated for each LSSD map, and the average and standard deviation (LSSD AVG , LSSD SD ) were used as metrics to quantify how the trainee's present contour as a whole deviated from the reference contour.These values were computed for the initial and final LSSD maps (LSSD AVG(i) , LSSD SD (i) , LSSD AVG(f) ,LSSD SD(F) ).For each trainee and each training case, the difference in LSSD SD(i) and LSSD SD(f) was computed to assess the functionality of the training tool (DLSSD SD ).Statistical significance was computed by using a two-tailed F test at the 95% confidence level.After this, the initial LSSD maps of all trainees and all training cases were combined into a single data set to compare with the consolidated final LSSD maps.The same metrics were computed for the data sets to assess the overall performance of the training tool.

Results
The contour training tool was successfully developed, validated, and by nine different individuals to ensure proper function of the contouring interface.Figure 4 shows a plot of the LSSD AVG and LSSD SD as a trainee progressed through contouring (the x-axis represents each time an LSSD map was regenerated).The graph of LSSD SD represents each trainee's progression during the contouring process, in that it trends towards 0 as the trainee uses the contour training tool to finalize their contours.
Next, the LSSD AVG(i) , LSSD SD(i) , LSSD AVG(F) ,LSSD SD(F) were computed and tabulated for each trainee and each training case and are tabulated in Table 1 for the heart and Table 2 for the LV.For every case, including the retention cases, a statically significant DLSSD SD was observed.The DLSSD SD is plotted in Figure 5 for the heart and in Figure 6 for the LV for each trainee.For all trainees and all cases, the initial and final LSSD maps were aggregated and a DLSSD SD was computed for both the training and retention cases.In that comparison, the DLSSD SD for the heart was 3.4 mm for the training cases and 2.7 mm for the retention set.For the LV, the DLSSD SD for the entire set was 4.1 mm for the training cases and 4.4 mm for the retention cases.These statistically significant DLSSD SD values are a strong indication that the contour training tool aided the trainees in the contouring process so that their contours became more consistent with the reference expert contours after the training.No statistically significant differences in DLSSD SD were observed between each trainee's last training case and the retention cases.

Discussion
Variability is well known to exist in the contouring process and remains a challenge in radiotherapy (3,5,8).The variability not only results from different experience of clinicians and their training on how to generate consensus contours, but also can be significantly influenced by the image quality of the contouring dataset.Our previously has shown this variability in contouring cardiac substructures.The calcification, metal artifacts, and blurry from respiratory motion can all contribute to the contouring variability (17).Although that contour training can reduce this variability, currently available training methods or tools are either greatly timeconsuming, lack real-time quantitative feedback, or are susceptible to variability among even experienced radiation oncologists who are providing the training.The contour training tool developed here has the capability to distribute expert contouring knowledge to a broad range of trainees with different backgrounds through established formal training sessions, so the trainees can improve their consensus contouring skill without the need for radiation oncology experts to be present.One substantial advantage of this tool is that it provides immediate feedback to the user as they contour a structure, which historically has been provided by a supervising radiation oncology staff member or by evaluating current anatomy on a CT scan against a reference, i.e., a peerreviewed data set.We demonstrated the effectiveness of this tool by contouring the heart and LV; however, this tool can easily be extended to contour other organs or treatment targets.Expanding this tool to cover other anatomical sites such as head and neck is our Representative initial and final LSSD maps for one trainee.Two LSSD colormaps are presented as the contourer progresses through a training session.The initial LSSD map (left) has large sections of red and blue, indicating that the trainee's contour is more than 3 mm from the reference contour in that sector.The final LSSD map (right) has an overall color closer to green, indicating that the trainee's contours are approaching an LSSD of 0.
future study.Indeed, our tool has enormous potential for reaching end users who require a means of accurately delineating structures with limited training resources (including access to radiation oncologist experts).We expect that this tool will be particularly useful in low and middle-income countries where trained radiation oncology staff are needed but available resources are limited.In addition, nowadays autocontouring has become more and more popular and gradually replaces manual contouring in routine clinic.However, quality check of autocontours still relies on clinicians.Therefore, this tool is still be particularly useful to train clinical staff in identifying correct anatomical structures for autocontouring quality assurance.
Our tool creates the reference contour by fusing the trainee contour with expert contours by using the STAPLE algorithm.By doing so, we acknowledge that the 'ground truth' contour is unknown.The reference contour is the consensus contour contributed by both experts and the trainee.The STAPLE algorithm is based on the maximum likelihood estimates of the  true positive and false negative of individual contours.If the trainee generates close to expert contours, higher weight will be assigned to the trainee contour in generating the reference contour so that the reference contour could potentially favor the evaluation.
On the other hand, if the trainee generates contours away from expert contours, the contribution to reference contour from the trainee contour will be small, which will unfavorable to the evaluation.This method has the potential to increase the sensitivity of consensus contour evaluation and also reduces the impact of inconsistent contours from experts to the evaluation.
To properly generate consensus contours (reference contours) by using the STAPLE algorithm, at least 3 expert contours are needed for each training structure.Also, to reduce the impact from trainee contour in generating reference contour, more expert contours are preferred.Therefore, expanding this software tool to cover training for other organs or treatment targets will require a significant effort to curate the expert contours.Establishing the gold standard via expert contours is the key to the use of this tool.A diverse group of expert contours needs to be evaluated.Also, the quality of the contour training depends on the quality of the expert  contours, which can vary from physician to physician and across institutions.This data curation process is often quite timeconsuming.as more and more high-quality benchmark datasets become available, such as The Cancer Imaging Archive (26), we expect to be able to easily expand the usability of our contour training tool to include more training cases and structures.
One limitation of this software tool is that our new LSSD metric can only handle regular shape structures with the geometric center within the contour.Most normal organs have a regular shape in 2D slices, and their contouring can be trained using this tool.However, for some structures with a complicated shape, such as optic chiasm and brachial plexus, our software tool is not applicable.In addition, this tool works functions when contouring axial slices, however some contours are better generated in the sagittal and coronal planes.Further development of this utility would need to accommodate for contouring on nonaxial reconstructed planes.

Conclusion
A software utility that served as a contour training tool was developed, tested, and implemented.This tool allowed users to be trained on the contouring process with real-time feedback on their contouring performance in terms of consistency with multiple contours by expert radiation oncologists.The software was designed with flexibility in mind so that it can be used to contour any anatomic site.For all cases tested, the trainees were able to use the training software to modify their contours to be more consistent with those of the experts.Although this study was done as a proof of principle, the software could easily be implemented on a larger scale for radiation oncology residents, junior faculty, and even senior faculty who need a refresher course on contour training.This tool could also be used for training dosimetrists and therapy staff who wish to improve both their knowledge of and consistency in anatomic contouring.

FIGURE 2
FIGURE 2 Contouring interface with LSSD map.An example of the contour training interface showing the contouring panel (left) and the interactive LSSD map (right).

FIGURE 4 LSSD
FIGURE 4 LSSD AVG and LSSD SD during contour training.Plot of the LSSD AVG and LSSD SD as trainee #9 progressed through the contour training process while contouring the heart.The x-axis on this graph represents each time the LSSD map was updated.ROI, region of interest; OAR, organ at risk.

FIGURE 5 DLSSD
FIGURE 5    DLSSD SD for the heart.DLSSD SD values for the heart are plotted for each assigned training case as well as the retention case.For each case, all nine trainees' results are displayed.