Quantitative Evaluation of Therapeutic Response by FDG-PET–CT in Metastatic Breast Cancer

Purpose To assess the therapeutic response for metastatic breast cancer with 18F-FDG position emission tomography (PET), this retrospective study aims to compare the performance of six different metabolic metrics with PERCIST, PERCIST with optimal thresholds, and an image-based parametric approach. Methods Thirty-six metastatic breast cancer patients underwent 128 PET scans and 123 lesions were identified. In a per-lesion and per-patient analysis, the performance of six metrics: maximum standardized uptake value (SUVmax), SUVpeak, standardized added metabolic activity (SAM), SUVmean, metabolic volume (MV), total lesion glycolysis (TLG), and a parametric approach (SULTAN) were determined and compared to the gold standard (defined by clinical assessment and biological and conventional imaging according RECIST 1.1). The evaluation was performed using PERCIST thresholds (for per-patient analysis only) and optimal thresholds (determined by the Youden criterion from the receiver operating characteristic curves). Results In the per-lesion analysis, 210 pairs of lesion evolutions were studied. Using the optimal thresholds, SUVmax, SUVpeak, SUVmean, SAM, and TLG were significantly correlated with the gold standard. SUVmax, SUVpeak, and SUVmean reached the best sensitivity (91, 88, and 83%, respectively), specificity (93, 95, and 97%, respectively), and negative predictive value (NPV, 90, 88, and 83%, respectively). For the per-patient analysis, 79 pairs of PET were studied. The optimal thresholds compared to the PERCIST threshold did not improve performance for SUVmax, SUVpeak, and SUVmean. Only SUVmax, SUVpeak, SUVmean, and TLG were correlated with the gold standard. SULTAN also performed equally: 83% sensitivity, 88% specificity, and NPV 86%. Conclusion This study showed that SUVmax and SUVpeak were the best parameters for PET evaluation of metastatic breast cancer lesions. Parametric imaging is helpful in evaluating serial studies.

inTrODUcTiOn Metastatic breast cancer is initially diagnosed in 6-10% of cases and during follow-up in 30% of cases (1). The treatment strategy in this situation is mainly based on chemotherapy, hormonal therapy, targeted therapies, and possibly external radiotherapy. The accurate and early assessment of therapeutic efficacy represents a major challenge but is crucial for limiting toxicity and reducing expensive treatments.
Current therapeutic responses for solid tumors are conventionally assessed using the international standard RECIST 1.1 (2). However, RECIST has a number of intrinsic limitations such as moderate reproducibility of tumor measurement (3), late occurrence of morphological response compared to early metabolic changes, not applicable with non-measurable morphological lesions (bone lesions, lymphangitis, and effusions), and in targeted cytostatic therapies. Functional imaging by position emission tomography (PET) with 18-fluorodeoxyglucose ( 18 FDG) represents a potential alternative (4,5). Specific evaluation criteria for metabolic responses have been previously defined. These include measures of quantitative metrics and visual analysis tools to classify tumor progression and response, as defined by the European Organization for Research and Treatment of Cancer (EORTC) (6) or PERCIST (3).
The 18 FDG-PET showed interest in breast cancer management (7), for initial staging of locally advanced cancers (stages II-III) and/or inflammatory lesions (8), detection of recurrence with better performance than conventional imaging (7,9,10), evaluation of therapeutic response to neo-adjuvant therapy in inoperable locally advanced cancers or before conservative surgery or inflammatory lesions (7,11), and therapy evaluation in metastatic disease (5,(12)(13)(14)(15)(16)(17)(18). However, although 18 FDG-PET proved interest in several clinical studies, it is not used in clinical practice for therapy assessment because of the lack of standardization of imaging interpretation (12). Some studies suggested a benefit of using semi-quantitative analysis (mainly the change in SUVmax or SUVmean between two PET scans) rather than visual analysis only. However, the best metric and optimal threshold was not clearly defined. Moreover, it is worth noting that none of these studies were based on the PERCIST approach proposed by Wahl et al. (3).
Semi-quantitative methods (3)(4)(5)(6) have been proposed for therapeutic evaluation using PET to improve reproducibility based on the percentage variation of a metric (SUVmax for EORTC and SULpeak for PERCIST). Yet, they have not been validated in the context of specific tumors, especially breast cancer (12). Moreover, some requirements of PERCIST (mainly need for a tumor size >2 cm and no difference between liver signal between the two PET scans) may be difficult to achieve in clinical practice.
New evaluation methods based on parametric analysis are also being developed, while the best metrics and optimal thresholds were not clearly defined (19). The SULTAN (longitudinal monitoring in tomography using factor analysis) method, for example, proposes a novel semi-automatic method to assist in tumor response assessment by studying the metabolic change at the voxel level (20,21). SULTAN provides a parametric map of the tumor metabolic change using two or more PET scans and allows the heterogeneity of response within the tumor to be determined.
The first objective of this retrospective study was to compare the performance of different metabolic metrics on a per-lesion and per-patient basis in the assessment of therapeutic response in metastatic breast cancer.
The second objective was to assess the benefit of parametric imaging (SULTAN) in this population.

Patients and imaging Protocols
For this single center study conducted from September 2009 to July 2014, 36 patients (median age 63.5 years, range: 39-85 years) with breast cancer of any histological grade and metastatic involvement (i.e., initially metastatic or metastatic following diagnostic evaluation), underwent at least two 18 FDG-PET scans using the same PET system in the course of their therapy. Tumor phenotypes were classified as 26 invasive ductal carcinomas, 6 invasive lobular carcinomas, 3 intraductal carcinomas, and 1 colloid carcinoma. Twenty-eight tumors were estrogen receptor (ER) positive, 21 progesterone receptor (PR) positive, 4 HER2 over-expression (HER2), and 6 were triple-negative. Treatments consisted of adjuvant chemotherapy, hormonal therapies, targeted therapies, Herceptin, and/or radiotherapy. A total of 128 PET scans were acquired (median of 3 PET/patient, range: 2-9) with a median time interval of 3.7 months between two PET (range: 1.1-19.6). A total of 123 lesions were analyzed: 44 lymph nodes, 43 bone lesions, 17 liver lesions, 10 breast lesions, 5 lung lesions, and 4 peritoneal carcinomatosis. A total of 79 pairs of PET scans were analyzed in 36 patients.
Position emission tomography scans were conducted in patients fasted for at least 6 h, with normal blood glucose <10 mmol/L, 1 h after injection of 3 or 7 MBq/kg of 18 FDG (depending on the PET system used), using either a Siemens Biograph mCT 40 camera (Siemens Healthcare Molecular Imaging USA, Inc.) or a General Electric Discovery LS (GE Medical Systems, Waukesha, WI, USA). The low-dose computed tomography acquisition was performed first without injection of iodinated contrast agent, followed by PET acquisition using 3 min per bed position (Siemens Biograph mCT) or 5 min for the GE Discovery LS. The following acquisition constraints according to the PERCIST framework were respected: similar activity between each PET scan (±20%), standardization against normal liver, and a similar delay between injection and acquisition (50-70 min after injection).

image analysis Using semi-Quantitative Metrics
Six PET-based metrics were derived, for a maximum of five tumor targets (maximum of two targets per organ) as recommended by PERCIST (3): SUVmax, SUVpeak, SUVmean, metabolic volume (MV), total lesion glycolysis (TLG = SUVmean × MV), and standardized added metabolic activity (SAM) (22). SAM was proposed to overcome the partial volume effect. The segmentation approach proposed by Schaefer was used for computing SUVmean, MV, and TLG (23).
The gold standard was defined by clinical assessment, and biological and conventional imaging by CT and MRI, performed 3 weeks after the PET evaluation. RECIST 1.1 (2) was used in these assessments. Each evolution was classified as either a responder or non-responder according to the gold standard.
A "responder" as assessed by PET was defined as a metric decrease greater than the threshold, while a "non-responder" was defined as a decrease of less than the threshold or an increase in the metrics. The four different types of response were true positive (TP), responder according to PET and the gold standard; true negative (TN), non-responder according to PET and the gold standard; false negative (FN), non-responder according to PET but responder according to the gold standard; and false positive (FP), responder according to PET but non-responder according to the gold standard.
image analysis Using Parametric imaging (sUlTan) SULTAN is a parametric approach that compares two or more PET scans acquired before and during therapy (20,21). In the context of this study, pairs of PET volumes acquired for the same patient were considered.
This new approach involves a rigid registration between the two PET scans, followed by a factor analysis as briefly described in the following sections.

Registration of PET Volumes
To compare two PET images at a voxel level, these scans first need to be registered so that a given voxel corresponds to the same volume element in each of the two scans. The method used was described in Ref. (24). Briefly, the CT volumes were used to determine the appropriate transformation for aligning the PET images, as they include far more anatomical details for guiding registration than the PET images. The two CT volumes of interest (VOIs) were registered using a rigid transformation (three translation and three rotation parameters) derived from blockmatching registration (19) as implemented in the Planet Onco software (Dosisoft). Local rigid transformation was assumed as only the region including mass was actually registered. The transformation mapping the second CT volume onto the first CT volume was then applied to the second PET scan so as to align it with the first PET scan, assuming the PET and CT of a given scan were perfectly registered.

Calculation of Parametric Image of Significant Tumor Changes
The two registered PET scans, denoted as PET1 and PET2, were analyzed using a factor analysis of dynamic sequences (FADS) approach (25) as implemented in the software Pixies [Apteryx, 2004]. The algorithm assumes that the two-component vector S(v, t) measured in each voxel (one value for the first scan and one value for the second scan) is a weighted sum of K basis functions. In this algorithm, the number K is constrained by the number of PET scans, hence is equal to 2. Let S(v, t) be the signal recorded at the voxel v for the time t (t = 1, 2). Then, where Cb(t) and Ce(t) are two basis kinetics, Ib is the spatial distribution of the voxel component following the Cb time course, Ie is the spatial distribution of the voxel component following the Ce time course, and e(v, t) is an additive error term. Factor analysis estimates the two functions Cb(t) and Ce(t), called factors, and their associated images Ib(v) and Ie(v), called factor images. Equation 1 is solved using a principal component analysis followed by an oblique rotation under a constant function constraint representing the constant voxels of the background (Cb) and without any other constraint on Ib(v) and Ie(v). The algorithm iteratively estimates the two factors, Cb and Ce, and the associated factor images, Ib and Ie (25). Therefore, the voxels that evolved between the two scans followed the Ce factor.
A new image (SULTAN image) is then created whereby each pixel v is equal to Ie(v) if |Ie(v)| > 1 or 0 otherwise. Hence, each voxel reflects its evolution over time following the Ce factor (Ie > 0) or the opposite direction of Ce (Ie < 0).
Finally, each lesion was classified as responder (main factor decreasing with Ie > 0 or main factor increasing with Ie < 0) or non-responder (main factor increasing with Ie > 0 or main factor decreasing with Ie < 0). Patient was considered as responder if all lesions were responders and non-responder otherwise. The results were then classified as VP, VN, FP, and FN by comparison with the gold standard.

statistical analysis
The study was performed using a per-lesion and a per-patient analysis. For each analysis, the metrics were compared using the area under the curve (AUC) determined with receiver operating characteristic (ROC) analysis.
The optimal thresholds were derived using the Youden criterion [max (sensitivity + specificity − 1)] through the ROC analysis for the per-lesion and per-patient studies.
The per-lesion analysis was performed using the percentage change using optimal threshold of each metabolic metric. Each lesion was then compared with the gold standard.
The per-patient analysis was performed using the PERCIST criteria (the percentage change of each metabolic metric for the most intense lesion in each PET between two scans). The percentage change was interpreted as responder or non-responder using previously optimized thresholds but also using PERCIST threshold (30% for each metric, except 45% for TLG). Each pair of PET scans was then compared with the gold standard.
The sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy were then calculated for each index.
Pearson's chi-squared analysis with a type I error of 0.05 and 1 degree of freedom was performed to determine significant associations between the different quantitative metrics and the gold standard.
We obtained informed consent from all patients allowing the use of their clinical data for research purposes under a protocol approved in our institution.   resUlTs

Per-lesion analysis Using Quantitative Metrics
A total of 123 lesions and 210 pairs of lesion evolutions, followed on two to nine scans, were analyzed with 111 considered as responders and 99 as non-responders according to the gold standard. Figure 1 shows the results of the ROC study for the six metrics. The AUC values ( Table 1) ranged from 0.55 for MV to 0.96 for SUVmax. The AUC intercomparison study distinguished three significantly distinct groups: SUVmax/SUVpeak/SUVmean, SAM/TLG, and MV (Figure 2).

Per-Patient analysis Using Quantitative Metrics
A total of 79 pairs of PET scans were analyzed using the PERCIST criteria (the most intense lesion in each PET between two scans) with 36 responders and 43 non-responders.
The AUC (Figure 3; Table 3) ranged from 0.61 for MV to 0.95 for SUVpeak. The AUC of SUVpeak, SUVmax, SUVmean, TLG, and SAM were significantly different from MV (p < 0.05) but not between each other (Figure 4).
The percentage change of each metabolic metric was also interpreted as responder or non-responder according to the choice of the threshold (PERCIST or optimal) and then compared with the gold standard.
After applying these optimized thresholds, the four metrics (SUVpeak, SUVmax, SUVmean, and TLG) were correlated with the gold standard ( Table 5). Threshold optimization did not change the specificity of SUVmax (98 vs. 95%). The sensitivity using SUVpeak was slightly improved (72 vs. 67%) with a similar NPV (81 vs. 78%). The sensitivity, NPV, and accuracy of TLG were improved (53 vs. 36%, 72 vs. 65%, and 78 vs. 71%, respectively). Figure 5 highlights the benefit of using quantitative PETderived metrics for a metastatic bone patient. CT images failed to correctly classify the therapeutic response, with the persistence of an osteo-condensation even though there was a primary tumor response, thus highlighting the fact that bone lesions cannot be evaluated using RECIST 1.1.

Per-lesion and Per-Patient analysis Using sUlTan
For the per-lesion analysis, results obtained with SULTAN (longitudinal monitoring in positron factor analysis) were compared with those obtained using SUVmax, SUVpeak, and SUVmean. No significant difference was found between the assessment of therapeutic response by the gold standard and SULTAN (p < 0.05).
For the per-patient PET analysis, SULTAN was compared with SUVmax, SUVpeak, and SUVmean, which appeared to be the only metrics significantly correlated to the gold standard. SULTAN presented no significant difference with SUVmax, SUVpeak, and      Table 6). Figures 6 and 7 show an example of a responder and a non-responder patient using SULTAN.

DiscUssiOn
Considering the limitations of morphological criteria and the subjectivity of visual analysis of metabolic imaging in the field of therapeutic evaluation, the use of quantitative PET-based metrics has gained interest in recent years (3, 7, 11-17, 26, 27). Depending on the disease studied, various metrics and thresholds have been established. In breast cancer, the majority of studies evaluating therapeutic response by metabolic metrics have been made in a neo-adjuvant setting, with histological confirmation, the true gold standard. In the adjuvant setting, the overall therapeutic response is usually assessed using morphological and metabolic imaging, and biological and clinical exams (7). The choice of a preferred biomarker differs between neoadjuvant and adjuvant settings. In the neo-adjuvant setting, with a curative intent, the NPV is the preferred relevant statistical information in early detection of non-responders before a change of therapy. In the adjuvant setting for metastatic patients, falsenegative PET may lead to a treatment change. This was designed to counteract a false-positive that may lead to a reduced survival. In this situation, choosing the best couple sensitivity-specificity may be considered as an acceptable compromise.
It has been reported that a decrease of SUVmax or SUVmean after one or two cycles of chemotherapy was significantly   correlated with a successful therapeutic response in the neoadjuvant setting (7,(28)(29)(30)(31)(32). The optimal thresholds reported in these studies for discriminating responder and non-responder in per-patient analysis varied from 26 to 58%. These differences can be partly explained by the lack of consensus for the definition of responder and non-responder status (decrease in tumor mass >50% by histology or residual microscopic lesions), the population heterogeneity between studies (presence of hormone receptors, HER2 amplification, etc.), the time of PET completion (one, two, or three cycles of chemotherapy), and the criteria used to determine the best threshold.
However, only a few studies have used PET scans for evaluating the treatment response in the context of adjuvant therapy. Couturier et al. (15) showed that a decrease of SUVmax or SUVmean was predictive of therapeutic response after three cycles of chemotherapy using the same gold standard considered in our study. They speculated that response assessment using metabolic metrics appeared to be superior to visual analysis. The SUV decrease ranged from 52 to 56% for responders and 16 to 26% for non-responders. Dose Schwarz et al. (17) found that a SUVmax reduction of 72 ± 21% after one cycle and 54 ± 16% after two cycles of chemotherapy was predictive of response to treatment. Furthermore, Specht et al. (16) and Tateishi et al. (18) concluded that a decrease of SUVmean, and to a lesser extent of TLG for bone metastases, was predictive of the duration of response to treatment. In the study of Tateishi (18), a SUVmean decrease ≥8.5% was a factor significantly related to the duration of response, while the TLG did not. Huyge et al. (33) highlighted the significant heterogeneity of the metabolic response for the same patient when considering the types of metastases (bone or visceral). Using the change of SUVmax, according to the EORTC criteria, they highlighted a poorer therapeutic response for bone lesions. Finally, Quon and Gambhir (34) has warned that the "paradoxical metabolic flare, " which corresponds to an increase of SUV in the first 10 days after commencement of hormone therapy, may be misconstrued as a sign of an early metabolic reaction.
In our study, SUVmax, SUVpeak, and SUVmean were the most efficient metrics in the per-lesion and per-patient analysis. These observations are consistent with previously published results, which suggest the use of SUVmax (EORTC) or SULpeak (PERCIST). The SUVmax measurement is susceptible to be affected by noise due to its single-voxel determination (35). The use of SUVpeak may overcome this limitation and has been recommended as a more robust alternative due to its fixed volume of 1 cm 3 , therefore being less susceptible to noise than SUVmax. However, several definitions of SUVpeak are found in the literature differing in shape, size, and location of ROIpeak (36). As outlined in Section "Introduction, " many requirements imposed by the PERCIST criteria may be considered as too restrictive and difficult to apply in routine clinical situations. This is why we evaluated a "PERCISTlike" method with a SUV normalization against the mass of the patient (SUVpeak) rather than the lean body mass normalization (SULpeak) as recommended by PERCIST. The small size of the majority of measured lesions in our study, less than 2 cm, leads to a calculation of SUVpeak heavily weighted on SUVmax, thus explaining the high similarity of the results of the two indices.
The SUVmean index gave results similar to SUVmax and SUVpeak for the per-patient analysis, also explained by the small size of the measured lesions.
The SAM index was less efficient in our study and did not demonstrate benefit in our population. This index corresponds to the total excess SUV above the tumor background, reducing the impact of partial volume effect and lesion segmentation errors. Yet, Mertens et al. (22) reported good results with no significant difference with SUVmax in patients with colorectal cancer with progression to liver metastasis. The optimal threshold for differentiating responders and non-responders was set at 94.5 vs. 25.3% for the SUVmax, which is different from our results (54 vs. 36%).
Additionally, we showed that MV and TLG failed to correctly classify patients. In this respect, MV performance was variable: the approach to this calculation differs among centers with the use of gradients, thresholds, or adaptive method. In the neo-adjuvant therapy evaluation of breast cancer by 18 F-FDG, Hatt et al. (37) found that TLG or MV, determined by a fuzzy locally adapted Bayesian algorithm, were better predictors than SUVmax, but the lesions they considered were larger than in our study. In our study, an adaptive method based on that described by Schaefer (23) was used, but it failed to correctly delineate the lesion when the signal-to-noise ratio was poor, explaining the poor performance of volume-based metrics.
Parametric imaging was found to be relevant in assessing the therapeutic response in breast cancer, with similar performance to SUVmax or SUVpeak. SULTAN has already been successfully assessed in patients with colorectal cancer and non-small lung cell carcinoma (20). SULTAN appears to be a valuable visual tool in routine clinical practice because of the otherwise tedious nature of measuring numerous lesions. Furthermore, using a single series of images, SULTAN provides a summary of all tumor evolutions from various scans without arbitrary threshold adjustment.
cOnclUsiOn Even if our study has limitations (heterogeneous population with patients in either first-line or advanced treatment, with varied histological and phenotypic characteristics and different treatments), the results underline the importance of the metrics choice for PET evaluation. SUVmax, SUVpeak, and to a lesser extent SUVmean appeared to be the most relevant metrics. In addition, parametric analysis using the SULTAN approach is a reliable tool to guide visual interpretation. The poor performances of volumetric metrics underline the need for developing and validating a robust delineation method that could be applied in the context of small lesion with a poor signal-to-noise ratio. In the future, a comparison of metrics could be conducted in a prospective study performed in a homogeneous population. aUThOr cOnTriBUTiOns DG: data measure and paper writing; HN: statistical analysis and parametric imaging; BH: statistical analysis; CR: patient recruitment; and TC and FK-B: study conception and paper correction. acKnOWleDgMenTs This work has been supported in part by grants from the French National Agency for Research called "Investissements d' Avenir" Labex IRON no. ANR-11-LABX-0018-01 and Equipex ArronaxPlus no. ANR-11-EQPX-0004. We also would like to acknowledge the meaningful contribution from Dr. Irène Buvat for giving very useful comments. First PET showed right axillary lymph nodes hypermetabolism and the second PET, performed during therapeutic monitoring (review 2), showed a disappearance of the right axillary hypermetabolism. Factorial image obtained by SULTAN was superimposed on the CT-scan 1 (B). Associated curves (c) represented the downward trend (green) or stable (blue) voxels. The developments described by factor analysis were similar to those of SUVmax (D) with a loss of the right axillary uptake.