Repeatability of Cardiac Magnetic Resonance Radiomics: A Multi-Centre Multi-Vendor Test-Retest Study

Aims: To evaluate the repeatability of cardiac magnetic resonance (CMR) radiomics features on test-retest scanning using a multi-centre multi-vendor dataset with a varied case-mix. Methods and Results: The sample included 54 test-retest studies from the VOLUMES resource (thevolumesresource.com). Images were segmented according to a pre-defined protocol to select three regions of interest (ROI) in end-diastole and end-systole: right ventricle, left ventricle (LV), and LV myocardium. We extracted radiomics shape features from all three ROIs and, additionally, first-order and texture features from the LV myocardium. Overall, 280 features were derived per study. For each feature, we calculated intra-class correlation coefficient (ICC), within-subject coefficient of variation, and mean relative difference. We ranked robustness of features according to mean ICC stratified by feature category, ROI, and cardiac phase, demonstrating a wide range of repeatability. There were features with good and excellent repeatability (ICC ≥ 0.75) within all feature categories and ROIs. A high proportion of first-order and texture features had excellent repeatability (ICC ≥ 0.90), however, these categories also contained features with the poorest repeatability (ICC < 0.50). Conclusion: CMR radiomic features have a wide range of repeatability. This paper is intended as a reference for future researchers to guide selection of the most robust features for clinical CMR radiomics models. Further work in larger and richer datasets is needed to further define the technical performance and clinical utility of CMR radiomics.


INTRODUCTION
Radiomics is an image analysis technique whereby a large number of advanced quantitative features are extracted from voxel level data of routine-care medical images (1). Radiomics data are structured in a minable format and can be used to develop models which link image features with biological phenotypes. The over-arching aim of radiomics analysis is to develop models for faster and more accurate disease diagnosis and risk prediction.
Radiomics features comprise (1) shape and (2) signal intensity-based features (Graphical abstract). Shape features include geometric quantifiers of the rendered volume, such as total volume, surface area, and descriptors of overall shape, such as sphericity, elongation, and compactness. Intensitybased radiomics features describe the global distribution (firstorder features) and pattern (texture features) of voxel signal intensities. First-order features describe the distribution of signal intensities of individual voxels, without consideration to spatial relationships. They are derived from histogram-based methods and summarize the intensity levels in the defined region of interest (ROI) into single quantifiers such as mean, median, maximum, randomness (entropy), skewness (asymmetry), and kurtosis (flatness). Texture features are statistical descriptors of the relationships between neighboring voxels of similar (or different) signal intensities. They are calculated using various matrix analysis methods according to standardized mathematical definitions.
The clinical utility of radiomics models for diagnosis, surveillance, and prognostication has been repeatedly demonstrated within the context of oncology (2)(3)(4)(5)(6)(7). Application of radiomics analysis to cardiac magnetic resonance (CMR) images is in its early developmental stages (1). Proof-of-concept studies have demonstrated incremental value of CMR radiomics models in distinguishing important disease entities such as hypertensive heart disease and hypertrophic cardiomyopathy (8), identification of myocardial infarction from non-contrast images (9)(10)(11), and prediction of life-threatening arrhythmias (12). Thus, CMR radiomics features may have potential as important novel quantitative imaging biomarkers (QIBs).
Translation of CMR radiomics to clinical practice requires external validity of proposed models. A key determinant of model performance in clinical and pre-clinical settings is repeatability, that is, the ability to repeatedly measure the same feature under identical or near-identical conditions on the same measurement unit (subject/phantom). CMR radiomics features are subject to technical (image acquisition, artifact, image processing) and population-related variations. However, their repeatability performance has not been adequately assessed in existing work. Such analysis is an essential step in assessing the clinical utility of this methodology, both for the underpinning research and the eventual clinical implementation.
We present, to the best of our knowledge, the first evaluation of the repeatability of CMR radiomics features on test-retest scanning using a multi-centre multi-vendor dataset with a varied case-mix. This paper is intended as a reference for future researchers to guide selection of the most robust features for inclusion in CMR radiomics models.
The design, terminology, and statistical methods reflect recommendations from the Quantitative Imaging Biomarker Alliance (QIBA) (13,14). QIBA is a group of the Radiological Society of North America established to guide standardization of the development and validation of QIBs. Reporting of methods is in line with relevant aspects of the Radiomics Quality Score (RQS) (15). The RQS provides guidance to improve quality and transparency of reporting in radiomics studies.

Setting and Study Population
We analyzed a subset of studies from the VOLUMES resource (16), comprising test-retest studies from five centres across the United Kingdom (Barts Heart Centre, University Hospitals Bristol, Leeds Teaching Hospitals, University College London Hospital, University Hospitals Birmingham NHS Trusts). The sample included a varied mix of disease and healthy cases. Exclusion criteria included age < 18 years-old, implantable cardiac devices, significant arrhythmia, claustrophobia, and poor breath-holding. Further information about the resource, acquisition protocols, and study population are detailed in a dedicated publication and online resource (16, 17). FIGURE 1 | Definition of the LV/RV blood pool and the LV myocardium for radiomics analysis. From left to right: 2D short axis mid-ventricular slice; segmentation of the three regions of interest shown overlaid on the image: LV myocardium (blue), LV blood pool (light blue), and RV blood pool (green); 3D reconstructions of the segmented ROIs. Please note, that radiomics analysis has been performed in 3D; 2D slices are provided for visualization purposes only. CMR: cardiac magnetic resonance; LV: left ventricle; ROI: region of interest; RV: right ventricle.

Scanning Protocol
Two vendors (Philips, Siemens), three models (Achieva, Avanto, Aera), and two magnet strengths (1.5 Tesla, 3 Tesla) were used. Scanning protocols across all contributing centres were in accordance with international recommendations (18). Complete short axis stacks covering the left and right ventricles (LV, RV) were acquired using balanced steady state free precession sequences. Details of acquisition parameters are summarized in Supplementary Table 1. Test-retest studies were performed under repeatability conditions with the same patient, location, scanner, acquisition protocol, and operating conditions. The time interval between test and retest was between 0 and 7 days. Given this very short test-retest interval, it is highly unlikely that any change in radiomics features could be due to alterations in the underlying cardiovascular health. Individuals having both scans on the same day were repositioned prior to retest with repeat isocentre positioning.

Image Segmentation
Image segmentation was performed blind to details of image acquisition, patient information, diagnosis, or scan pairings. LV endocardial and epicardial and RV endocardial contours were drawn in end-diastole and end-systole on short-axis stack images to select three ROIs for radiomics analysis: RV blood pool, LV blood pool, and LV myocardium. The blood pool ROIs reflect LV and RV cavities in end-diastole and end-systole. Segmentation was performed according to a pre-defined standard operating procedure (SOP) (19). Papillary muscles were considered part of the LV blood pool; the basal LV slice was included if there was >50% myocardium circumferentially, and for the RV, volumes below the pulmonary valve were included with position judged by review of cine images and orthogonal cuts. Contours were drawn using a machine learning approach with expert edits using Circle R cardiovascular imaging version 5.11.0 (Circle cardiovascular imaging Inc., Calgary, Canada). Initial checks and adjustments were made by Z.R.E., trainee cardiologist with 2-years' experience in CMR and dedicated training in the SOP, and cross-checked by S.E.P., consultant cardiologist with over 15-years' experience with CMR.

Radiomics Feature Extraction
Radiomics feature extraction was performed blind to details of image acquisition, patient information, diagnosis, or scan pairings. Contours from the image segmentation were used to create 3D image masks for the three ROIs in end-diastole and end-systole (Figure 1). Toward this, voxels belonging to the three ROIs were indicated as foreground voxels using a unique label per ROI, whilst all other voxels were defined as background. An in-house software implemented in Python was used to convert the contours into binary masks. In brief, the image contour was parsed into an xml file that contains the coordinates of all contour points. Subsequently, a polygon was built joining the points in the coordinate space to form the mask. Lastly, the area bounded by the contour in every slice is filled with ones using OpenCV function, fillpoly, resulting in the binary ROI. The process was repeated for all delineated contours. The image masks and the corresponding CMR DICOM R (Digital Imaging and Communications in Medicine) images were converted to NIFTI (Neuroimaging Informative Technology Initiative) format for subsequent processing. Radiomics features were extracted from the 3D CMR images and the corresponding 3D mask (i.e., the full 3D   although currently considered deprecated, were largely used in the past. Overall, 16 shape, 19 first-order, and 73 texture features were available, we applied all feature categories to the LV myocardium, and shape features to the LV and RV blood pool ROIs. For gray value discretisation, we used a fixed bin width of 25 intensity values. The texture features were extracted using five different matrices: gray-level co-occurrence matrix (GLCM, 23 features), gray-level run-length matrix (GLRLM, 16 features), gray-level size-zone matrix (GLSZM, 15 features), neighboring gray tone difference matrix (NGTDM, 5 features), and gray-level dependence matrix (GLDM, 14 features). In total, 280 features across the three ROIs, two phases, and three radiomics categories (shape, first-order, texture) were calculated per study.

Statistical Analysis
We considered intra-class correlation coefficient (ICC) as a valid aggregate summary of repeatability performance in this setting. For calculation of ICC, we used a one-way random effects model for absolute agreement based on a single measure; as the two time points (test, retest) can be considered interchangeable, the one-way model is valid and appropriate for our analysis (20). For each radiomics feature, we calculated the ICC and corresponding 95% confidence interval using the variance components from a one-way ANOVA (analysis of variance). We assigned descriptive terms to ICC values in line with published guidance on ICC interpretation (20): <0.5 poor, 0.5-0.75 moderate, 0.75-0.9 good, ≥0.9 excellent. We ranked robustness of features according to the mean ICC stratified by feature category, ROI, and cardiac phase. In addition, for each feature, we report withinsubject variability expressed through within-subject coefficient of variation (CV) and mean relative difference. We present Bland-Altman plots for a selection of exemplar features from different levels of repeatability.

Population Characteristics
The sample included 54 paired test-retest CMR scans of 40 men and 14 women with mean (standard deviation) age of 51.9 (±16.8) years. Nine subjects were healthy volunteers. The remainder had a range of ischaemic and non-ischaemic cardiovascular conditions ( Table 1). The majority of scans were performed on 1.5 Tesla Siemens scanners (Aera, Avanto). Three cases were performed on 3 Tesla Philips Achieva scanners. The interval between test and retest was no more than 7 days and for the majority, both scans were performed on the same day (85%, n = 46).

Repeatability of Conventional CMR Indices
We first studied the repeatability of conventional CMR indices to assess possible loss of robustness associated with the segmentation process. We calculated ICC, CV, and mean relative difference for LV end-diastolic volume, LV end-systolic volume, LV ejection fraction, LV mass, RV end-diastolic volume, RV end-systolic volume, and RV ejection fraction (Supplementary Table 1). There was excellent repeatability for LV end-diastolic volume (ICC 0.97, 95% CI 0.96-0.99), LV end-systolic volume (ICC 0.96, 95% CI 0.93-0.98), and LV mass (ICC 0.95, 95% CI 0.91-0.97). As expected, repeatability of the RV indices, was slightly lower than that of the LV. Thus, we confirmed good quality contouring with repeatability of conventional CMR indices overall exceeding that of previous reports (19).

Repeatability of LV Blood Pool Shape Features
Repeatability of LV blood pool shape features varied from moderate to excellent with mean ICC ranging from 0.511 to 0.974 [Median (IQR): 0.871 (0.175)] (  Figure 2). Overall, there was better repeatability in end-systole than in end-diastole ( Figure 3A). The most robust features were "volume" in both end-systole and end-diastole, "least axis length" in end-diastole, and "surface area" in end-systole. In both end-diastole and end-systole, the least robust features were "spherical disproportion, " "sphericity, " "compactness, " and "compactness 2."

Repeatability of RV Blood Pool Shape Features
Repeatability of RV blood pool shape features varied from moderate to excellent with mean ICC ranging from 0.556 to 0.941  Figure 4). Overall, there was better repeatability in end-diastole than in end-systole ( Figure 3B). The most robust RV shape features were "volume" in end-diastole, "minor axis length" in end-systole, and "surface area" in both phases. As for the LV blood pool, "spherical disproportion, " "sphericity, " "compactness 2, " and "compactness" had the poorest repeatability across both cardiac phases.  Figure 5). As with the LV blood pool shape features, there was better repeatability of myocardial shape features in end-systole than in end-diastole ( Figure 3C). The most robust features in both end-diastole and end-systole were "minor axis length, " "least axis length, " "surface area, " and "volume." The least robust features were "flatness" and "maximum 3D diameter" in both cardiac phases.

Shape Feature Trends Across Regions of Interest
Across all three regions of interest and the two phases, "volume" and "surface area" followed by measures of the heart short axis, i.e., "least axis length" and "minor axis length, " showed the highest average repeatability (Supplementary Figure 1). The correlated sphericity-measuring features, i.e., "spherical disproportion, " "sphericity, " "compactness 1, " and "compactness 2, " produced the lowest average reproducibility and greatest variance in reproducibility across all regions (Supplementary Figure 1).

Repeatability of LV Myocardium First-Order Features
Repeatability of LV myocardium first-order features varied from poor to excellent with mean ICC ranging from 0.333 to 0.964 [Median (IQR): 0.932 (0.140)] ( Table 5, Supplementary Table 5, Figure 6). The proportion of features demonstrating excellent repeatability (28/38, 74%) was substantially higher than that seen for the shape features. This was alongside a small number (4/38, 11%) of particularly poorly performing features. Overall, repeatability was high in both end-diastole and end-systole, with marginally better overall performance in the former (Figure 7A). For both cardiac phases, the best performing first-order features were "entropy, " "percentile 90, " "root mean squared, " "median, " and "mean." The following features had the worst performance in both end-diastole and end-systole: "kurtosis, " "minimum, " "skewness, " and "variance."

Repeatability of LV Myocardium Texture Features
Repeatability of LV myocardium texture features varied from poor to excellent with mean ICC ranging from −0.  repeatability (125/146, 86%). A small minority of features had poor repeatability (7/146, 4.8%). There was slightly better repeatability in end-diastole than in end-systole ( Figure 7B). We present the ten best and worst performing texture feature and their corresponding ICCs in end-diastole ( Table 6) and end-systole (Supplementary Table 8). Across both end-diastole and end-systole, "cluster shade" and "cluster prominence" were poorly performing features. In end-systole, "strength, " "inverse difference normalized, " and "inverse difference moment normalized" also demonstrated poor repeatability. We also evaluated differences in the reproducibility of features by texture class i.e., GLCM, GLRLM, GLSZM, NGTDM, and GLDM (Supplementary Figure 2). The most striking difference between texture classes was the variation in the range of ICC values. The GLCM class had the widest ICC range with very low ICC values calculated for some of the features in this class. Indeed, six of the seven texture features with the poorest repeatability belong to the GLCM class. However, broadly, all texture classes had similar mean repeatability; with the exception of GLRLM that had a significantly greater average repeatability than NGTDM, no other pairs of classes showed a significant difference in mean ICC.

Summary of Findings
In this heterogenous case mix of test-retest studies, we demonstrated wide variation in the repeatability of CMR radiomics features by ROI, feature category and cardiac phase.  There were features with good and excellent repeatability within all feature categories and ROIs. The signal intensitybased features (first-order, texture) demonstrated the greatest variation in repeatability comprising a large proportion of highly reproducible features alongside features with the poorest repeatability. We present details of repeatability performance for a comprehensive range of radiomics features, which is intended to guide selection of the most robust features for clinical modeling by future researchers. Therefore, this work is an important step in characterizing the technical performance of CMR radiomics and enhancing future efforts to evaluate its clinical utility.

Comparison With Existing Literature
There have been recent efforts to define the repeatability of radiomics features relating to oncological imaging with test-retest studies (21-23) and using phantom (24), image translation (25), and image pertubation (26) experiments. These studies demonstrate variation in feature repeatability and emphasize the need to actively seek and select robust features for modeling purposes. However, these findings have limited transferability to CMR radiomics, due to the modalities studied (mostly CT) and because the ROIs selected for oncological tumor analysis are not comparable to those typically selected for CMR analysis. Nevertheless, our findings of variation in repeatability by feature category (first-order > shape > textural) are in close agreement with previous work regarding cancer radiomics. Jang et al. (27) present the only other study to consider repeatability of CMR radiomics LV texture features (rather than texture, first order, and shape features in our analysis) in 51 patients with clinical indication for CMR scanned twice in the same session with a 3 Tesla Siemens scanner. A subset of the study participants had abnormal CMR findings ("normal" n = 14, non-ischaemic cardiomyopathy n = 16, ischaemic cardiomyopathy n = 5, hypertrophic cardiomyopathy n = 2, other n = 14). The authors report variation in repeatability between classes of texture features and, similar to our findings, demonstrate that only a subset has high repeatability. Overall, when comparing equivalent measures of intra-observer variability for LV texture features, we had better repeatability indices compared to that reported by Jang et al. (27). This may reflect differences in contouring SOP between the two approaches; our contouring methodology is designed to avoid blood pool or pericardial fat in myocardial contours as inclusion of these in analysis can highly distort texture feature values, it is not clear if this was a key part of the SOP used by Jang et al. (27). Whilst we include both 1.5 and 3 Tesla scanners in the sample, the majority of our cases were scanned with a 1.5 Tesla scanner. 3 Tesla sequences are more prone to artifacts specially dark/bright lines across images and this too may have contributed to the poorer repeatability observed by Jang et al. (27). Studies in larger samples are warranted to further explore potential explanations for these differences and to perform subgroup analyses.
Our study is the first to report repeatability of LV and RV CMR radiomics shape features. Radiomics shape features are calculated from 3D image masks derived from image contours, as such, their repeatability is a direct reflection of segmentation robustness. For instance, we demonstrate better repeatability of features quantifying the heart short axis, e.g., "least axis length, " "minor axis length" and "maximal 2D diameter, " than those quantifying the long axis, e.g., "major axis length" and "maximum 3D diameter." The reduced reproducibility of features along the cardiac long axis likely reflects segmentation robustness which is likely to suffer more at the apex and base of the heart rather than in the middle slices. This is consistent with our observation of low repeatability of all features quantifying ventricular sphericity.
Signal intensity-based features (first-order, texture) applied to the LV myocardium reflect both segmentation and signal intensities within the defined ROI. These features are therefore sensitive to variations in image acquisition which affect intensity levels within the whole image. Furthermore, there is potential to introduce extreme outlier values in the segmentation process. For instance, an LV endocardial contour that is not perfectly opposed to the endocardium would introduce a series of high value voxels from the blood pool into what will be defined as "myocardium" for radiomics analysis (Supplementary Figure 3). Our findings support these theoretical suppositions. The most reproducible first-order features within the LV myocardium ("entropy, " "root mean squared, " "median, " "mean, ") are measures of the average voxel SI levels, whilst the least reproducible first-order features  ("kurtosis, " "minimum, " "skewness, " "variance") are measures of their spread. Consistent with this, the least reproducible texture features, "cluster shade" and "cluster prominence, " also represent measures of skewness. 30 These measures of spread are, of course, more susceptible to small variations in extreme signal intensity values. Notably, repeatability of conventional CMR indices in our study exceeded that of published reports. Particularly, the metric most relevant for defining the LV myocardium for LV analysis, LV mass, had excellent repeatability with ICC of 0.95 (0.91, 0.97). Therefore, as would be expected, radiomics features have, in general, much higher sensitivity to small variations in segmentation, which appear inconsequential to conventional metrics. Texture radiomics are affected not only by segmentation but are additionally sensitive to image acquisition settings and pre-processing. Variation in image signal intensities due to technical factors (scanner specifications, sequence acquisition parameters) may be reduced through pre-processing intensity normalization techniques, which may improve the repeatability of signal intensity-based radiomics by "smoothing" variations in intensity levels.

Study Limitations and Directions for Future Research
This study presents an important first step in evaluating the technical performance of CMR radiomics first-order, texture, and shape feature. The present dataset does not permit consideration of the wide range of technical and population related factors that may be modifying the repeatability performance of radiomics features. Studies considering the impact of factors such as scanner vendor/model, magnet strength, acquisition parameters, and disease are warranted. To guide building of radiomics models that would truly translate to clinical practice, we should consider robustness of features not only under repeatability, but also under reproducibility conditions, where real-life variations in scanner, operator, and image acquisition are not strictly controlled. Finally, different technical approaches to feature extraction and image normalization may improve robustness of radiomics features, in particular for intensity-based features. For example, different approaches to gray level discretisation have been shown to affect feature robustness (28) and future research on optimizing bin width or bin number may improve radiomics robustness. Lastly, we have focused on radiomics computed on original (untransformed) images. Whilst this covers the vast majority of features in common use, there are additional features that are beyond the scope of this study, such as features extracted from mathematical transformations of the original images. There is also need for study of normalization techniques which may improve repeatability performance of radiomics features; this is a broad topic with a large number of normalization options (e.g., histogram matching, generative adversarial networks) that should be considered systematically in dedicated studies.

CONCLUSIONS
There is variation in the repeatability of CMR radiomics features, which is likely to be clinically relevant. In this paper we present repeatability performance of a comprehensive range of commonly used CMR radiomics features. The work is intended to guide future researchers to select the most robust radiomics features for clinical modeling. Further work in larger and richer datasets and experimentation with different technical approaches is needed to further define the repeatability and reproducibility of CMR radiomics and to ascertain the optimal technical approach for radiomics analysis for maintaining feature robustness.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: https://thevolumesresource.com.

AUTHOR CONTRIBUTIONS
ZR-E, SEP, KL, NCH, and PBM conceived the study. ZR-E and PG wrote the manuscript. ZR-E and SEP analyzed the CMR scans. JC supervised and advised on the statistical analysis. PG extracted radiomics features and conducted the statistical analysis. AJ contributed to manuscript editing and statistical analysis. JA, ANB, RHD, CHM, and JCM collated the studies in the VOLUMES resource. All authors provided critical review of the manuscript.