Evaluation of the Hippocampal Normal Tissue Complication Model in a Prospective Cohort of Low Grade Glioma Patients—An Analysis Within the EORTC 22033 Clinical Trial

Purpose: To evaluate the performance of the hippocampal normal tissue complication model that relates dose to the bilateral hippocampus to memory impairment at 18 months post-treatment in a population of low-grade glioma (LGG) patients. Methods: LGG patients treated within the radiotherapy-only arm of the EORTC 22033-26033 trial were analyzed. Hippocampal dose parameters were calculated from the original radiotherapy plans. Difference in Rey Verbal Auditory Learning test delayed recall (AVLT-DR) performance pre-and 18 (±4) months post-treatment was compared to reference data from the Maastricht Aging study. The NTCP model published by Gondi et al. was applied to the dosimetric data and model predictions were compared to actual neurocognitive outcome. Results: A total of 29 patients met inclusion criteria. Mean dose in EQD2 Gy to the bilateral hippocampus was 39.8 Gy (95% CI 34.3–44.4 Gy), the median dose to 40% of the bilateral hippocampus was 47.2 EQD2 Gy. The model predicted a risk of memory impairment exceeding 99% in 22 patients. However, only seven patients were found to have a significant decline in AVLT-dr score. Conclusions: In this dataset of only LGG patients treated with radiotherapy the hippocampus NTCP model did not perform as expected to predict cognitive decline based on dose to 40% of the bilateral hippocampus. Caution should be taken when extrapolating this model outside of the range of dose-volume parameters in which it was developed.


Methods:
LGG patients treated within the radiotherapy-only arm of the EORTC 22033-26033 trial were analyzed. Hippocampal dose parameters were calculated from the original radiotherapy plans. Difference in Rey Verbal Auditory Learning test delayed recall (AVLT-DR) performance pre-and 18 (±4) months post-treatment was compared to reference data from the Maastricht Aging study. The NTCP model published by Gondi et al. was applied to the dosimetric data and model predictions were compared to actual neurocognitive outcome.
Results: A total of 29 patients met inclusion criteria. Mean dose in EQD2 Gy to the bilateral hippocampus was 39.8 Gy (95% CI 34. 3-44.4 Gy), the median dose to 40% of the bilateral hippocampus was 47.2 EQD2 Gy. The model predicted a risk of memory impairment exceeding 99% in 22 patients. However, only seven patients were found to have a significant decline in AVLT-dr score.
Conclusions: In this dataset of only LGG patients treated with radiotherapy the hippocampus NTCP model did not perform as expected to predict cognitive decline based on dose to 40% of the bilateral hippocampus. Caution should be taken when extrapolating this model outside of the range of dose-volume parameters in which it was developed.
Keywords: NTCP (normal tissue complication probability) model, low grade glioma (LGG), model verification and validation, neurocognition, memory, late effect of cancer treatment, radiotherapy-adverse effects INTRODUCTION Low grade glioma (LGG) are a group of relatively slow growing primary brain neoplasms, mainly occurring in those between 30 and 50 years of age (1,2). Modern treatment for LGG patients comprises surgery followed by radiotherapy and adjuvant chemotherapy (3). Overall survival was recently reported to be 13.3 years (4), but can vary with molecular subtype.
With many LGG patients living for many years or even decades after treatment, the late adverse effects of treatment on quality of life and neurocognitive functioning are of increasing importance. Although both the tumor itself, as well as the use of anticonvulsant therapy, have a deleterious effect on neurocognitive function (5,6), radiotherapy (RT) in particular has been associated with a negative impact on neurocognitive function. This late effect of radiotherapy was found in several series with a longer follow-up (7,8), however, it was not found in several studies that limited observation to the first 5 years (9)(10)(11)(12).
A dose response relationship with decreasing neurocognitive performance (specifically memory) has been attributed to the hippocampal area (13). A NTCP model for memory impairment was proposed by Gondi et al. (14). In this study, 18 patients undergoing fractionated stereotactic radiotherapy for benign and low-grade tumors (9 vestibular schwannomas, 2 pituitary adenomas, 3 meningiomas, and 4 low grade gliomas) completed a comprehensive baseline neurocognitive assessment and a repeat assessment at 18 months. A control group of 6 non-irradiated subjects was tested as well, allowing for the use of Z scores for performance change. Dose in excess of 7.3 EQD2 Gy to 40% of bilateral hippocampus were found to be significantly correlated to a decrease in Wechsler Memory Scale III-Word Lists delayed recall score, a test that measures verbal memory performance.
Although this model is routinely used in the clinic, its performance has not yet been quantified in the setting of partial brain irradiation in a population of LGG patients. We analyzed data from a recently completed and published randomized phase III trial, where LGG patients in the control arm were treated exclusively with focal radiotherapy up to 50.4 Gy (15) and compared the predicted risk of neuropsychological impairment with the actual outcome. Clinical Trials Unit). The study was approved by the institutional review boards and ethics committees of all participating centers. All patients provided written informed consent at the time of registration (15).

Patient Population
In the aforementioned trial, patients of 18 years of age or older with histologically confirmed and centrally reviewed low-grade (WHO 2) glioma (diffuse astrocytoma, oligoastrocytoma and oligodendroglioma, WHO classification 2006) with at least one high-risk feature (age >40 years, progressive disease, tumor size >5 cm, tumor crossing midline, any focal neurological deficit) were randomly assigned to treatment with either radiotherapy (28 × 1.8 Gy) or temozolomide chemotherapy. Between September 2005 and March 2010 477 patients were randomized. The study design, treatment details and the results of the primary analysis have been described elsewhere (15). A total of 103 patients from preselected medical centers also underwent a detailed neurocognitive examination consisting of the Rey Auditory Verbal Learning test (AVLT), Concept Shifting test, Categoric Word Fluency test, and the Digit-Symbol Substitution test. Neurocognitive tests were conducted at randomization and then every 6 months until to tumor progression or death.
The analysis presented herein contains patients with retrievable radiotherapy planning data and neuropsychological testing at both baseline and 18 (±4 months). The neurocognitive analysis for the entire patient population of EORTC 22033-26033 is reported elsewhere (16). The present study was conducted according to the principles of the Declaration of Helsinki (59th WMA General Assembly, Seoul, October 2008) and in accordance with the local medical research regulations. The study protocol has been presented to the local Medical Ethics Committee (MEC-2017-321). No ethical approval was deemed necessary and the requirement for additional informed consent was waived.

Neuropsychological Assessments
One of the tests in the neuropsychological assessment is the AVLT, which calls for various aspects of learning and recall. The delayed recall condition (AVLT-dr) requires patients to memorize a list of 15 words for five consecutive tests, and to recall these 15 words after 20 min. The maximal score is 15 out of 15. This test is conceptually identical to the delayed recall condition in the Wechsler Memory Scale 3-word lists used by Gondi et al. as the primary outcome measure.
In contrast to the original paper by Gondi et al., EORTC22033-26033 does not include a control group of healthy volunteers. Normal data for AVLT-dr, with testretest changes, has been published by the Maastricht Aging Study group (17). This study tested healthy volunteers using several neuropsychological tests at 2.5 year intervals and gives parameters for a regression-based change analysis of test-retest performance. The following relationship between age and change in AVLT-dr retest score was found.
Where E is the expected change between test and retest-score. This can be converted to a Z score using the standardized residual (which was found to be 2.362 in this test condition).
Where O is the observed retest score, and E is the predicted retest score. As reported in the paper by Gondi et al., a neurocognitive event was defined as a reduction in AVLT-dr score at 18 months corresponding to a Z score lower than −1.5.

Radiotherapy Treatment
Patients were treated with photon radiotherapy using 3D conformal radiotherapy (3DCT), fractionated stereotactic radiotherapy (FSRT) or intensity modulated radiotherapy (IMRT) techniques depending on the availability at the institution. Gross tumor volume (GTV) was defined by the region of high signal intensity on T2 weighted MRI of FLAIR sequences, or, in case of prior surgery, the resection cavity and the residual tumor. Clinical target volume (CTV) margin was 10-15 mm. Planning target volume (PTV) margin was 7 mm for all patients. As required per protocol the contralateral hemisphere was spared, but no specific attempt at sparing one or both hippocampi was made.

Delineation and DVH Analysis
A rigid registration was applied between the planning CT and MRI using MIMSoftware (Cleveland, OH, USA). Hippocampus delineation followed the instructions of the publicly available atlas from RTOG0933 (18). In case no registration was possible, delineation was performed on CT using anatomical landmarks visible on MRI. Dose volume histograms (DVH) and subsequent DVH parameters were generated for left and right hippocampus individually and for composite bilateral hippocampi. As presented in the paper by Gondi et al., we assumed an α/β value of 2 to convert physical dose to biologically equivalent dose in 2 Gy fractions (EQD2 Gy). The Dx% of bilateral hippocampus was defined as the dose in EQD2 Gy received by x % of bilateral hippocampal volume.

Statistical Approach
Descriptive statistics were generated for age, tumor laterality, tumor lobe, anti-epileptic drug treatment (AED), education, CTV volume, and hippocampal dosimetry ( Table 1). The model used by Gondi et al. is based on the Lyman model (19). Their formulation was presented as follows: Where t is a function of TD 50 , the dose to 40% of hippocampus at which the probability of neurocognitive decline is 50%, and m, is a slope parameter (see below).
In the paper published by Gondi et al., the obtained values of TD 50 and m were 14.88 and 0.54, respectively. We applied this model to generate predicted NTCP values for the dose distributions in our study population. Cases were grouped in three bins of equal size, according to ascending NTCP. In order to compute the observed risk the incidence of a neuropsychological event in each bin is computed. The predicted NTCP was plotted against observed NTCP in a calibration plot. Next, a linear regression was performed. The regression coefficients can be used to calibrate the model to the dataset, the constant can be used as offset parameter and the slope indicates over-or underestimation of the observed risk.
In order to quantify model performance, the Brier score (BS) was calculated for the original formulation of the model. BS is a measure of the accuracy of a prediction with a binary outcome: Where n is the number of observations, f a is the probability that was forecast, and o a is the outcome (1 if the event occurs and 0 if it does not occur). A low Brier score is indicative of good model performance, it reflects a strong correlation between forecast and outcome.

Other Predictive Parameters
In addition to evaluating the performance of the NTCP model, we investigated if CTV volume, laterality, age, handedness, and WHO performance score were associated with cognitive deterioration. To this end, using Spearman's correlation coefficient, a correlation matrix was made to identify if bilateral and contralateral hippocampal DVH parameters correlated with cognitive deterioration.

Power Considerations
In the paper by Gondi

Patient Data
Of 477 patients within EORTC 22033-26033, 103 patients underwent full neurocognitive testing. Of these, 54 patients were treated with radiotherapy-only. Of these, 33 patients had a complete neurocognitive assessment at baseline and at a median follow up of 18.5 months (95% CI 17.3-18.9). Complete original dosimetry data was available for 31 patients. Two patients were excluded due to clinically progressive disease at time of neurocognitive outcome assessment (Figure 1). Table 1. Median age of patients at randomization was 43 years (95% CI 39-47). Only three patients did not require anti-epileptic medication. Sixteen tumors were left sided, 10 right sided, and three were bilateral. Final resection status was biopsy only in 15 patients, gross total resection in two patients, and partial resection in twelve patients. An IDH mutation was present in 27 patients, absent in one patient and undetermined in one patient. An 1p/19q codeletion was present in 10 patients, absent in 14 patients, and undetermined in five patients. Twentyeight patients were treated to a dose of 50.4 Gy in 28 fractions, one patient was treated to a dose of 54 Gy in 30 fractions. Twenty-five patients were treated with 3DCT, three with IMRT and two with fractionated stereotactic radiotherapy. Mean CTV volume was 340 cc (95% CI 276-403). Mean dose in EQD2 Gy to bilateral hippocampi was 31.4 Gy (95% CI 27.2-35.6). The mean D40%BH was 40.9 Gy (95% CI 35.8-46.0), and the median D40%BH was 47.2 Gy. Only one patient had a D40%BH lower than 7.3 Gy. Mean dose in EQD2 Gy to contralateral hippocampus was 21.6 Gy (95% CI 16.7-26.9). Overall, there was no significant difference between pre-and post-radiotherapy AVLT-dr score (95% CI 1.09-2.16; Figure 2). A cognitive event was scored in seven patients (24.1%). At the time of analysis, the median time to progression in 14 patients was 2.9 years (95% CI 2.2-3.6). Fifteen patients were free of progressive disease after a median follow-up duration of 3.3 years. We compared the subgroup of patients with available data (n = 31) with the rest of the study population (n = 446). The groups were comparable with respect to tumor laterality, tumor lobe, performance status, progression free survival, and presence of an 1p/19q codeletion. However, the number of IDH wildtype tumors was significantly lower in the study population (3.2 vs. 14%, p = 0.025, see Supplementary Data).

Model Performance
We were unable to compare the incidence of cognitive events between the high and low dose group as described in the paper by Gondi et al. (D40%BH < 7.3 Gy) as there was only one case in the low dose group. However, there was no difference in the incidence of a cognitive event between the group that received a D40%BH above vs. below the median (47.2 Gy) in this study (14 vs. 25%, p = 0.68). NTCP values are presented in Table 2 with dosimetry and neurocognitive results. A calibration plot is presented in Figure 3. Linear regression showed a constant of 0.03 (p = 0.60) and a slope of 0.24 (p < 0.01) at an r 2 of 0.346. The Brier score of the model was 0.63.

Dosimetric Parameters
A heat map of the correlation matrix is presented in Figure 4. Increasing age (p = 0.04) and tumor localization in the left hemisphere (p = 0.01) were related to poorer neurocognitive outcome at 18 (±4) months. None of the bilateral hippocampal dose volume parameters (D10%, D20%, D30% up to D90%, D95% and mean dose) did exhibit a significant correlation with outcome.

DISCUSSION
To the best of our knowledge, this is the first attempt to quantify the performance of the hippocampal NTCP model within a group of only LGG patients treated with partial brain irradiation. This model was used in RTOG 0933hippocampal sparing whole brain radiotherapy vs. standard whole brain therapy in brain metastases and in the recently presented phase III trial exploring WBRT plus memantine, with or without hippocampal avoidance (NRG-CC001) (18,20). Brain metastases are almost never observed in the hippocampus, and selective avoidance of this region is not likely to result in a higher risk of intracranial recurrence (21). This is less clear in LGG where tumor cells are known to be present within the entire brain (22). Moreover, subventricular zone involvement has been shown to be a biomarker for poor prognosis (23), making the hippocampus a potential treatment target.
In the calibration procedure, the positive slope in the linear regression indicates an overestimation of NTCP values by the model in this dataset. The high Brier score indicates poor model performance. In comparing the two study groups, the incidence of a neurocognitive event is similar (29.2 vs. 24.1% in this study) but the range of hippocampal dose is quite different. The median D40%BH in the paper by Gondi et al. was 7.3 Gy, at above which a NTCP of 66.7% was observed. By contrast, the median D40%BH in this paper is 47.2 Gy and all but one of the patients in the present study received a D40%BH in excess of 7.3 Gy. In comparing the two groups, there are substantial differences in the delivery technique and target volume. In the paper by Gondi et al., most patients were treated without a CTV expansion and with limited PTV margins (2 mm) using highly conformal dose distributions. In the present study, patients were treated with a CTV margin of 10-15 mm and a larger PTV (7 mm) resulting in substantially larger target volumes, and the delivery technique was mainly 3DCRT. It is likely that this resulted in higher doses to bilateral hippocampus in this study, to a degree that almost none of the patients were in the low dose group. As such, we were unable to compare the incidence of neurocognitive impairment between the high dose and the low dose group. However, the hippocampal doses in this study group are probably a good representation of the hippocampal dose range found in LGG patients undergoing radiotherapy. Therefore, this study should not be read as a formal disapproval of the hippocampal NTCP model, but rather as a caution toward extrapolating a NTCP model beyond the dose range in which it was developed. A similar issue was encountered by Moiseenko in comparing NTCP models for radiation toxicity to the visual apparatus (24). Since no significant correlation between dosimetric parameters and outcome was observed, we were unable to generate an alternative model from this dataset.
The choice of endpoint, neurocognitive failure at 18 months after radiotherapy, is debatable in LGG patients. Trials that found FIGURE 2 | Histogram of differences in AVLT-dr score per patient (baseline minus follow-up). Overall, there was no significant difference between pre-and post-radiotherapy AVLT-dr score.  a significant effect of radiotherapy on neurocognitive function typically only did so after a follow-up >5 years (7,8), whereas several trials with a shorter follow-up found no significant, or only transient, deleterious effects (9)(10)(11)(12)25). This begs the question whether neurocognitive impairment at 18 months is indeed indicative of a persistent neurocognitive deficit. Although preclinical and radiological (26,27) data demonstrated appreciable changes within the hippocampus after radiotherapy, a relationship between cognitive performance and a D40% as low as 7.3 EQD2 Gy was not found in the current study but also not in other studies. In the setting of prophylactic WBRT in small cell lung cancer and partial brain irradiation for glioblastoma multiforme, Ma et al. (28) demonstrated D50% of 22.1 Gy to be associated with a 20% risk of a significant decline in Hopkins Verbal Learning Test (HVLT)-delayed recall score. Peiffer et al. (29) identified the volume of bilateral hippocampi receiving 60 Gy as a possible predictor for cognitive decline. The analysis by Okoukoni et al. (30) established a correlation between posttreatment HVLT score (no baseline measurement was done) and even higher doses to the bilateral hippocampi. Here, hippocampal V55 Gy of 0, 25, and 50% were associated with post-radiation impairment rates of 14.9, 45.9, and 80.6%, respectively.
In this study, we used prospectively acquired baseline and follow up data from the recently completed EORTC22033-26033 trial, ensuring a homogenous patient group with good adherence to protocol. The subset of patients included in this analysis is a relatively small proportion of the radiotherapy-only group (15%). The main reason for this is that neurocognitive testing was not mandatory, and a number of centers did no neurocognitive testing. However, we found no significant differences in clinical variables (save for presence of IDH mutation) and time to progressive disease between our subset of and rest of the study population. In comparing our neurocognitive event-definition to the one used in the paper by Gondi et al., we did not utilize a control group but published test-retest data from the Maastricht Aging study. This data is derived from a study group that is older (49-56 years), than the average patient in our study (43 years), and the test-retest interval is twice as long (3 years).
In this dataset of only LGG patients, the NTCP model did not perform as expected in predicting cognitive decline based on dose to bilateral hippocampus. Clearly, the understanding of the relationship between dose to subsites in the CNS and neurocognitive functioning is still limited, and there exists a paucity of prospective neuropsychological and dosimetric parameters with an adequate duration of follow-up.

DATA AVAILABILITY STATEMENT
The datasets generated for this study are available on request to the corresponding author.

ETHICS STATEMENT
This study was carried out in accordance with the Dutch Medical Research (Human Subjects) Act (WMO). The protocol was approved by the Medical Research Ethics Committee Erasmus MC. All subjects gave written informed consent in accordance with the Declaration of Helsinki.

AUTHOR CONTRIBUTIONS
The study was conceptualized and the manuscript was written by JJ and AM. The statistical analysis was done by JJ and MH. MB, MT, and MK helped interpreting the data. JJ, RW, DE, FL, and AL were involved in data collection. All authors critically reviewed the manuscript and commented on the final version. The final authorship position is shared between BB and MK.