Radiomic Features From Diffusion-Weighted MRI of Retroperitoneal Soft-Tissue Sarcomas Are Repeatable and Exhibit Change After Radiotherapy

Background Size-based assessments are inaccurate indicators of tumor response in soft-tissue sarcoma (STS), motivating the requirement for new response imaging biomarkers for this rare and heterogeneous disease. In this study, we assess the test–retest repeatability of radiomic features from MR diffusion-weighted imaging (DWI) and derived maps of apparent diffusion coefficient (ADC) in retroperitoneal STS and compare baseline repeatability with changes in radiomic features following radiotherapy (RT). Materials and Methods Thirty patients with retroperitoneal STS received an MR examination prior to treatment, of whom 23/30 were investigated in our repeatability analysis having received repeat baseline examinations and 14/30 patients were investigated in our post-treatment analysis having received an MR examination after completing pre-operative RT. One hundred and seven radiomic features were extracted from the full manually delineated tumor region using PyRadiomics. Test–retest repeatability was assessed using an intraclass correlation coefficient (baseline ICC), and post-radiotherapy variance analysis (post-RT-IMS) was used to compare the change in radiomic feature value to baseline repeatability. Results For the ADC maps and DWI images, 101 and 102 features demonstrated good baseline repeatability (baseline ICC > 0.85), respectively. Forty-three and 2 features demonstrated both good baseline repeatability and a high post-RT-IMS (>0.85), respectively. Pearson correlation between the baseline ICC and post-RT-IMS was weak (0.432 and 0.133, respectively). Conclusions The ADC-based radiomic analysis shows better test–retest repeatability compared with features derived from DWI images in STS, and some of these features are sensitive to post-treatment change. However, good repeatability at baseline does not imply sensitivity to post-treatment change.

Background: Size-based assessments are inaccurate indicators of tumor response in soft-tissue sarcoma (STS), motivating the requirement for new response imaging biomarkers for this rare and heterogeneous disease. In this study, we assess the testretest repeatability of radiomic features from MR diffusion-weighted imaging (DWI) and derived maps of apparent diffusion coefficient (ADC) in retroperitoneal STS and compare baseline repeatability with changes in radiomic features following radiotherapy (RT).

Materials and Methods:
Thirty patients with retroperitoneal STS received an MR examination prior to treatment, of whom 23/30 were investigated in our repeatability analysis having received repeat baseline examinations and 14/30 patients were investigated in our post-treatment analysis having received an MR examination after completing pre-operative RT. One hundred and seven radiomic features were extracted from the full manually delineated tumor region using PyRadiomics. Test-retest repeatability was assessed using an intraclass correlation coefficient (baseline ICC), and post-radiotherapy variance analysis (post-RT-IMS) was used to compare the change in radiomic feature value to baseline repeatability.

INTRODUCTION
Soft-tissue sarcomas (STS) are rare tumors of the connective tissues and account for 1% of all cancers (1). While the radiological assessment of STS typically includes size-based criteria, such as those defined by the Response Evaluation Criteria in Solid Tumors guidelines (RECIST 1.1) (2), intralesion heterogeneity is commonly seen in the clinic, both in tumor appearance and treatment response (3). Furthermore, several studies have reported that changes in tumor size have a poor correlation with histopathological tumor response (4)(5)(6)(7)(8). This has led to guidelines being published by The European Organisation for Research and Treatment of Cancer (EORTC) Soft Tissue and Bone Sarcoma Group (9), where it is recommended that size and volume measurements should not be used to reflect histopathological response following treatment (except for myxoid liposarcomas). There is therefore an urgent need to develop robust clinical imaging biomarkers (IBs) that i) better reflect histopathological change and ii) capture intralesion heterogeneity before, during, and after treatment.
Quantitative diffusion-weighted imaging (DWI) is showing increased utility for monitoring response in STS (10). Measurements of the apparent diffusion coefficient (ADC) calculated from DWI have demonstrated an inverse correlation with tissue cellularity (11) and thus could act as a surrogate IB for the early assessment of radiotherapy treatment response (12). A key advantage of such quantitative techniques includes the fact that derived maps are representative of tumor biology and thus may offer deeper insights into the heterogeneous patterns of tumor response. In a previous cohort study of patients with retroperitoneal STS, median ADC after radiotherapy demonstrated a significant increase compared to baseline and 4/14 patients showed an increase in median ADC outside 95% repeatability limits of agreement (10). However, the assessment of total tumor ADC failed to capture the spatial heterogeneity within these lesions, obscuring the interpretation of changes following treatment (13).
Radiomic analysis extracts a set of mathematical features describing the relationships and patterns between pixels that quantify image characteristics such as texture, intensity, and shape. Radiomic features are thought to reflect the heterogeneity of underlying biological features within the tumor such as necrosis, vascularity, and histological variation (14,15). In a recent study investigating ADC-based radiomic features in STS by Corino et al., differences were observed in radiomic feature values between intermediate-and high-grade lesions (16). Lee et al. observed that ADC-based radiomic features may quantify tumor heterogeneity, although they did not find an improvement in diagnosing benign and malignant STS compared to ADC alone (17).
Recently, changes between pre-and post-treatment radiomic features (delta-radiomics) have been associated with tumor response; Gao et al. demonstrated that ADC-based deltaradiomics improved response prediction in STS following preoperative radiotherapy treatment using a support vector machine (SVM) model (18). While these data are encouraging, they are limited without evaluation of the repeatability of ADCbased radiomic features, as outlined in recent consensus recommendations for the clinical translation of response IBs (19). To the best of our knowledge, there exists no study assessing the test-retest repeatability of ADC-based radiomics in STS.
This study has two main aims. Firstly, we aim to assess the test-retest stability of radiomic features derived from ADC measurements (quantitative imaging) and compare this with the corresponding features derived from low b-value DW images (qualitative imaging) in a cohort of patients with retroperitoneal STS. Secondly, as baseline repeatability is not necessarily related to sensitivity to response (20), we introduce a novel metric that compares the baseline repeatability of radiomic features with their ability to demonstrate change following treatment. We then use this methodology to identify a set of radiomic features that are both highly repeatable and sensitive to post-treatment change for use in prospective clinical STS studies.

MATERIALS AND METHODS
This study was reviewed and approved by the Royal Marsden Hospital committee for clinical research and approval from a national Research Ethics Committee (East of England-Cambridge East Research Ethics Committee).

Patient Population, Imaging, and Radiotherapy Schedule
Thirty patients with retroperitoneal STS received MR examinations on a 1.5-T MR scanner before treatment (MAGNETOM Aera, Siemens Healthcare, Erlangen, Germany). Imaging included axial DWI with b-values of 50, 600, and 900 s/mm 2 . Full details of the study, patient protocol, and imaging protocol have been reported previously by Winfield et al. (10). Twenty-seven of 30 patients were repositioned and then received a repeat baseline DWI acquisition in the same scan session. Four of these patients were excluded from the repeatability analysis due to a change in image acquisition parameters during the second baseline acquisition [for two patients, the imaging field of view (FoV) was reduced for the second scan for patient comfort, and two patients had a change in image intensity between repeat scans]. Fourteen of 30 patients were treated with radiotherapy; these patients received at least one baseline MR scan and another after completing radiotherapy treatment, prior to surgery. Figure 1 shows the study organization of patients included in each of the repeatability and delta-radiomics sections of this analysis. Supplementary Material A presents the breakdown of STS subtypes studied and whether they received a second baseline and/or postradiotherapy scan. For those treated with radiotherapy, 28 daily fractions were administered over 5.5 weeks delivering a median dose of 50.4 Gy.

Image Processing and Radiomic Feature Extraction
Regions-of-interest (ROIs) were delineated on every slice in which the tumor appeared on axial T 2 -weighted images, using inhouse software by experienced soft-tissue sarcoma radiologist (CM) with over 10 years of experience and transferred to all imaging series.
ADC maps were created using a least-squares monoexponential fit (21). ROIs were transferred onto the calculated ADC maps and subsequently converted into binary mask segmentations. To allow for direct comparison between quantitative and non-quantitative imaging, the ROIs were also transferred onto the b = 50 s/mm 2 diffusion-weighted images (hereafter referred to as b50); b50 was chosen as it had the highest signal-to-noise ratio compared with the b = 600 and 900 s/mm 2 images. As some patients required different imaging FoVs to the standard protocol, images and segmentations were resampled to have matching voxel sizes (2.375, 2.375, 5.0 mm) across all patients (ADC and b50 images were resampled using linear interpolation, and masks were resampled using nearest neighbor interpolation) and stacked to create 3D volumes (22,23). To generate additional image sets, histogram equalization, which spreads out pixel intensity levels resulting in heightened image contrast and texture, was applied to the ADC maps and b50 images within the delineated tumor regions. The histogram-equalized images were scaled by 300 to match the gray level range of the ADC images [units of 10 −5 mm 2 s −1 were used to match previously published work (24)]. The original b50 images were not rescaled as the gray level range was already within the same order of magnitude. Radiomic feature extraction was performed on the four different sets of images: i) ADC maps, ii) b50 images, iii) histogramequalized ADC maps, and iv) histogram-equalized b50 images. The open-source package PyRadiomics (25) (v3.0.1.) was used to extract radiomic features from all four image contrasts in 3D: 18 first-order, 75 second-order (glcm, gldm, glrlm, glszm, ngtdm), and 14 shape features. The following settings were used across all image contrasts: bin width = 10 and force2D = True (due to the anisotropic voxel dimensions). No wavelet or other filtering operations were performed.

Calculation/Theory
We denote x ikl as the i-th radiomic feature, for the k-th patient and l-th baseline measurement. From the natural logarithm of these values, yikl = ln(xikl), the following repeatability statistics were derived.
The baseline within-subject standard deviation for N patients is defined as s w bs = . The between-subject mean squares (BMS) and within-subject mean squares (WMS) are calculated, respectively, as BMS = 2 ) − 1 . Two of the radiomic features (Skewness and glcmClusterShade) returned both positive and negative values and thus could not be analyzed using the natural logarithm of their values. For these two features, the analysis was performed on the raw data and the LoA was calculated as follows:

Statistical Analysis
For all image contrasts, the number of features that satisfied two different criteria was identified: (i) Good repeatability: Radiomic feature had a baseline ICC greater than 0.85 (28).
(ii) Substantial change after treatment: Radiomic feature had a post-RT-IMS greater than 0.85.
To determine whether the baseline ICC is indicative of the post-RT-IMS, the Pearson correlation coefficient between both measurements across all features (PCC) was calculated for each image contrast.

Post-RT Fractional Changes
To quantify post-RT changes in radiomic feature i for patient k, we define the fractional change as D ik = x ik3 −x ik1 x ik1 .

Independent Subset
The values of the radiomic features from the first baseline scan were used to form a matrix of Pearson product-moment correlation coefficients, r ij = C ij ffiffiffiffiffiffiffiffiffiffi ffi C ii * C jj p , where C ij is the covariance between features i and j. Higher agglomerative clustering was performed to obtain cluster groups of strongly correlated features (hereafter referred to as correlation groups) using seaborn (v0.9.0) and scipy (v1.3.1) (29-31) with the following settings: distance metric = 1 − r 2 ij , cluster method = average, and cluster distance cutoff = 0.5.
We assume features from different correlation groups to be independent. The correlation groups for the non-histogramequalized ADC-based radiomic features are in Supplementary Material C.
To explore a small subset of the features in more detail, a subset of independent features that demonstrated change posttreatment (the independent delta-radiomics subset) was identified by selecting the feature, within each correlation group, with the highest post-RT-IMS that satisfied criteria (i) and (ii).  Table 1). The correlation between baseline ICC and post-RT-IMS was 0.432 ( Figure 2), reflecting the relative lack of overlap between the number of features that satisfied both criteria (i) and (ii). Agglomerative clustering revealed 18 groups of pairwise correlated features, indicating a maximum of 18 independent feature groups. Eight of these correlation groups included at least one feature demonstrating both a high baseline ICC and a high post-RT-IMS (the ADC-independent delta-radiomics subset), indicating a maximum of eight independent features that demonstrate change post-treatment. For non-histogram-equalized b50 images, 102/107 features demonstrated a high baseline ICC, while only three demonstrated a high post-RT-IMS. The number of features that satisfied both criteria was further reduced to two. The correlation between baseline ICC and post-RT-IMS was 0.133. Agglomerative clustering revealed 13 correlation groups; two of these correlation groups included one feature with both a high baseline ICC and a high post-RT-IMS (the b50 independent delta-radiomics subset), indicating a maximum of two independent features that demonstrate change post-treatment.

RESULTS
Histogram equalization had little effect on the number of features that demonstrated high baseline ICC (103 and 102 for the ADC maps and b50 images, respectively). However, histogram equalization reduced the number of features that satisfy both criteria (i) and (ii) for the ADC images to 14 (removing nearly all second-order features except glcmMaximumProbability) but increased the number for the b50 images to 11. The correlation between baseline ICC and post-RT-IMS for histogram-equalized ADC maps and b50  images was 0.288 and 0.247, respectively ( Figure 2). Agglomerative clustering revealed 17 and 18 correlation groups for the histogram-equalized ADC maps and b50 images, respectively, of which six and five groups included at least one feature that satisfied both criteria. The baseline ICC, post-RT-IMS, s w bs , s b bs , and s w rt are shown for the independent delta-radiomics subset for each image contrast in Supplementary Material D.
As the non-histogram-equalized ADC maps returned the highest number of features that satisfied both criteria, these features are explored in more detail. Bland-Altman plots are shown for the ADC-independent delta-radiomics subset in Figure 3. The vertical axis shows the difference between the radiomic feature values across measurement 1 and measurement 2, and the horizontal axis shows the mean value between the two measurements. The plots show no clear evidence of bias and suggest that these features show good repeatability. The fractional changes in these ADCderived radiomic features after radiotherapy are presented in Figure 4. Some of the radiomic features demonstrated similar treatment changes in all patients; 90percentile and TotalEnergy tended to increase after radiotherapy and glcmJointEnergy tended to decrease, while other features demonstrated a more even distribution over increase and decrease.

DISCUSSION
Nearly all radiomic features demonstrated good test-retest repeatability across both ADC maps and b50 images [criterion (i)]; however, the number of features that demonstrated change post-treatment was markedly lower [criterion (ii)]. Our repeatability results are in line with a previous finding by Bologna et al., where they identified 59/69 ADC-based radiomic features in STS as stable to geometrical transformations of the ROI (32). Our test-retest study uniquely assesses the stability of features in the context of a repeat baseline investigation, which includes additional clinical sources of variability such as patient positioning (33). The high repeatability found in our study may be in part due to using data from a single scanner at a single site and that sarcomas tend to be large in volume and immobile compared to many other tumor sites. Peerlings et al. found that 25%-29% of ADC-based radiomic features presented test-retest stability in other cancers across a variety of tissues, MR systems, and vendors for a standardized protocol (34).
ADC maps and histogram-equalized b50 images returned a higher number of features that demonstrate change posttreatment compared to the original low b-value images. MR signal is relative and can suffer from inhomogeneities which may affect radiomics analysis (33). We demonstrate that quantification via ADC fitting and/or histogram equalization Although showing good repeatability at baseline, none of the shape features demonstrated a high post-RT-IMS, suggesting that they do not change greatly after treatment in retroperitoneal sarcomas. This is consistent with the findings of previous studies and the recommendation by the EORTC (9). When comparing baseline repeatability alone, our analysis showed that ADC maps returned a high number of stable features [criterion (i)], which noticeably dropped when the condition of an expected significant change after treatment [criterion (ii)] was included. A similar drop in the number of stable features derived from b50 images was also observed when comparing features that satisfy both criteria, highlighting the important finding that good baseline repeatability is not necessarily indicative of sensitivity to post-treatment change. These results are further supported by the finding that the correlations between baseline ICC and post-RT-IMS in ou patient population are low (Figure 2). Similarly, Gudmundsson et al. demonstrated that high stability does not necessarily imply predictive power in classification models (20).
Although 107 features were calculated, only a maximum of 18 linearly independent groups were identified for the features, suggesting that many of the radiomic features are highly correlated, consistent with the results found in the literature (15,35). For the features that demonstrate change post-treatment, only a maximum of eight linearly independent features were found.
When forming an independent delta-radiomics subset, we chose to explore the features with the highest post-RT-IMS from each correlation group. In this set, several of the features (90percentile, TotalEnergy, and glcmJointEnergy) show similar changes post-treatment (Figure 4), which could be showing a treatment effect. Other features demonstrate both increases and decreases for different patients, and this may be representative of the heterogeneous response typical of these tumor types or may be due to histological differences. Although many of the larger changes following treatment are shown by liposarcoma, synovial sarcoma, and pleomorphic sarcoma n.o.s., the sample size is too small to draw correlations with histology.
There are limitations to our study. Radiomic features have been shown to have varying stabilities with different image preprocessing and different settings used for feature extraction, and sensitivity to intrascanner variation (15,33,(35)(36)(37)(38)(39). We kept a fixed bin-width throughout our analysis and did not use any further image pre-processing; it is possible that applying different pre-processing techniques will identify different radiomic features for treatment response evaluation that are stable and sensitive. Furthermore, our study utilized data from a single center with a single rater, and future multicenter studies could further elucidate the reproducibility of radiomic features at different imaging centers and investigate the effect of rater intraobserver variability. Apart from the in-house software  In conclusion, our data suggest that although nearly all DWIbased radiomic features demonstrate good baseline test-retest repeatability in STS, only a subset of features demonstrate significant change after radiotherapy. By introducing a new measure of radiomic feature stability (the post-RT-IMS), we show that good baseline repeatability does not necessarily imply a good ability to measure change post-treatment. Furthermore, we identify a range of ADC-based radiomic features that demonstrate change post-treatment, encouraging further investigation into their suitability as response markers.

DATA AVAILABILITY STATEMENT
The data from the present study are available in the ICR's XNAT repository. Access requests will be granted depending on appropriate regulatory and institutional approvals upon contacting the corresponding author.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by The Royal Marsden Hospital Committee for Clinical Research and approval from a national Research Ethics Committee (East of England-Cambridge East Research Ethics Committee). The patients/participants provided their written informed consent to participate in this study.