Exploring the Utility of Radiomic Feature Extraction to Improve the Diagnostic Accuracy of Cardiac Sarcoidosis Using FDG PET

Background This study aimed to explore the radiomic features from PET images to detect active cardiac sarcoidosis (CS). Methods Forty sarcoid patients and twenty-nine controls were scanned using FDG PET-CMR. Five feature classes were compared between the groups. From the PET images alone, two different segmentations were drawn. For segmentation A, a region of interest (ROI) was manually delineated for the patients' myocardium hot regions with standardized uptake value (SUV) higher than 2.5 and the controls' normal myocardium region. A second ROI was drawn in the entire left ventricular myocardium for both study groups, segmentation B. The conventional metrics and radiomic features were then extracted for each ROI. Mann-Whitney U-test and a logistic regression classifier were used to compare the individual features of the study groups. Results For segmentation A, the SUVmin had the highest area under the curve (AUC) and greatest accuracy among the conventional metrics. However, for both segmentations, the AUC and accuracy of the TBRmax were relatively high, >0.85. Twenty-two (from segmentation A) and thirty-five (from segmentation B) of 75 radiomic features fulfilled the criteria: P-value < 0.00061 (after Bonferroni correction), AUC >0.5, and accuracy >0.7. Principal Component Analysis (PCA) was conducted, with five components leading to cumulative variance higher than 90%. Ten machine learning classifiers were then tested and trained. Most of them had AUCs and accuracies ≥0.8. For segmentation A, the AUCs and accuracies of all classifiers are >0.9, but k-neighbors and neural network classifiers were the highest (=1). For segmentation B, there are four classifiers with AUCs and accuracies ≥0.8. However, the gaussian process classifier indicated the highest AUC and accuracy (0.9 and 0.8, respectively). Conclusions Radiomic analysis of the specific PET data was not proven to be necessary for the detection of CS. However, building an automated procedure will help to accelerate the analysis and potentially lead to more reproducible findings across different scanners and imaging centers and consequently improve standardization procedures that are important for clinical trials and development of more robust diagnostic protocols.


INTRODUCTION
Sarcoidosis is a multisystem, granulomatous inflammatory disease of unknown etiology, characterized by the presence of non-caseating granulomas in the involved organs (1,2). Sarcoidosis primarily affects the lungs. The development of this disease in the pulmonary system has been identified in more than 90% of reported cases (3,4). However, it can affect the extrapulmonary organs as well, including the heart (5). Clinically, cardiac involvement is uncommon, manifesting in only ∼5% of sarcoid patients, but it can occur without apparent symptoms, i.e., a "clinically silent" disease, which is reflected in the high rate of cardiac involvement in autopsy studies. At least 25% of patients with sarcoidosis are diagnosed with cardiac involvement (6)(7)(8).
The challenging in diagnosing cardiac sarcoidosis (CS) is due to the probability of involving any organ, leads to variability in clinical presentation (9). In addition, a lack of reliable biomarkers or diagnostic tests poses a challenge to diagnosing cardiac sarcoidosis. Furthermore, the role of advanced imaging modalities such as Cardiovascular Magnetic Resonance Imaging (CMR) with Late Gadolinium Enhancement (LGE) and [ 18

F] Fluorodeoxyglucose Positron Emission Tomography [[ 18 F] FDG
PET] have been demonstrated in the literature to improve the identification and treatment of patients with CS. Currently, these imaging tools are critical for early diagnosis, disease prediction and progression, and therapeutic response monitoring.
To increase the diagnostic performance of [ 18 F] FDG PET, it is important to suppress the use of glucose by normal cardiomyocytes as this improves its specificity. Several approaches have been proposed, including following a ketogenic diet (high fats and low carbohydrates), prolonged fasting, intravenous heparin, and usually, a combination of these methods (10). However, strategies to improve diagnostic performance do not help in up to 25% of patients, which can result in false-positive findings (11) due to failure to suppress the physiological uptake of the myocardium. A semi-quantitative analysis can be used to diagnose CS. A common tool, a maximum standardized uptake value (SUV max ), can identify the highest uptake value within the region of interest (ROI). This can differentiate positive (CS + ) and negative (CS − ) results; however, in the presence of high physiological uptake, this metric fails to detect sarcoidosis within this region (12). In addition, the maximum target-to-background ratio (TBR max ) is more robust than SUV max due to the effective normalization for blood uptake (12,13), which makes it more reliable for comparing data across patients and institutions. Radiomic features, which rely on the spatial correlations of image values or derived image-based metrics, have the potential to elucidate features robust to background physiological uptake. The purpose of this study is to explore radiomic features from PET images to identify potential candidate radiomic metrics. Specifically, this study will characterize radiomic features that separate active CS from controls.

Ethical Approval
This study was conducted with the approval of the Institutional Review Board at Mount Sinai (GCO # 01-1032), and all subjects gave written informed consent.

Subject Selection
Subjects with clinical suspicion of CS based on demonstrated clinical manifestations of extracardiac lesions and/or disease were recruited at Mount Sinai Hospital in New York, to undertake a PET-CMR examination. All subjects were treatment-naïve and had to avoid carbohydrate diet for 24 h before the scan and fast during the last 12 h. The preparation for imaging followed the recent recommendations by Ishida et al. (14). After the acquisition, the results were assessed by an expert cardiologist for indications of CS and had no indications of failed suppression of FDG uptake. Subjects were divided into patients and controls based on their results. Subjects with patchy FDG uptake were designated as CS+ and were assigned to the patient group for this study (15), and those without either FDG or CMR findings were designated as control subjects for this study. Control population had normal cardiac appearance and regular echocardiography. Forty patients and twenty-nine controls met these criteria for this study. Exclusion criteria include insulin-dependent diabetes mellitus, pretest blood glucose >200 mmol/dl, menopausal phobia, pregnancy/lactation, the presence of a cardiac pacemaker or automatic implantable cardioverterdefibrillator, and renal dysfunction.

Imaging Protocol
The simultaneous CMR with LGE and [ 18 F] FDG PET on an integrated PET-CMR system (Biograph TM mMR, Siemens Healthcare, Erlangen, Germany) was used in this study. Five MBq/kg of [ 18 F] FDG was injected into the patients intravenously, who then waited for 10 min. Thoracic PET acquisition (one-bed position centered on the heart) took about 90 min but for this study only a late time window (last 60 min) was selected. PET images were reconstructed using the iterative ordinary Poisson ordered subset expectation maximization (OP-OSEM) with three iterations and 21 subsets on a 344 × 344 × 129 image matrix and an isotropic voxel size of 2 mm, followed by an isotropic 4 mm Gaussian post-filtering. The data obtained with PET were not respiratory-gated or ECG-gated and were not corrected for any potential motion artifacts. A 3D breathhold Dixon-based MR image was used for attenuation correction. Simultaneously with PET imaging, CMR was performed with electrocardiograph triggered; the scan included short-axis T2 mapping and cine images. Approximately 15 min after 0.2 mmol/kg gadolinium injection, inversion-recovery fast gradientecho LGE sequences were acquired.

Segmentations
3D slicer software (Version 4.11.2; https://www.slicer.org) was used for the segmentation (16,17). Segmentations were performed by study personnel according to methods used in a previous study (12).

Segmentation A
From the PET images (with use of CMR for anatomical localization, and aiding in focal lesion identification when possible) of the patient group, an ROI was manually drawn in the hot region of the myocardium with an SUV higher than 2.5, which is a cut-off value previously used to differentiate between benign (normal in cases of CS) and malignant (abnormal in cases of CS) lesions (18,19). For patients with more than one focal lesion, the largest and most active was selected. Due to the focal nature of the disease, applying a threshold helped ensure that the extracted features are only from voxels with abnormal uptakes. For the control group, an ROI was drawn manually in the normal myocardium. Once the SUV max and SUV mean (in the blood pool of the right atrium) were extracted, the TBR max was calculated using the following equation:

SUV mean background
Thirty-five subjects out of forty who had a TBR max within the range of 1 to 3 and patchy uptake were labeled as patients. The remaining five subjects who had TBR max > 3 were excluded as failed suppression could not be completely discounted in these cases (12) even though the FDG was patchy and initially included in the study cohort and subsequently in the study cohort for segmentation B.

Segmentation B
As the approach A took into account both intensity and pattern, it was useful to investigate a different approach that was independent of these. From the PET images, an ROI was drawn in the entire left ventricular myocardium for forty patients and twenty-nine controls regardless of the TBR max findings and SUV thresholds to compare the reliability of features among segmentation approaches. Radiomic features and conventional metrics were then extracted.

Feature Extraction
PyRadiomics (Version 3.0.1) was used to extract five feature classes (75 features in total) from the PET image ROIs of the patients and controls (20) in addition to the conventional metrics (7 metrics). PyRadiomics adheres to the image biomarker standardization initiative (IBSI's feature definitions). A bin width of 0.05 was applied. All other parameters were left as default. Harmonization was not required for these datasets as they originated from a single scanner. A list of all radiomic features and conventional metrics is shown in Supplementary Material 1.

Statistical Analysis
Statistical analyses were undertaken using Scikit-learn software (Version 0.23.2) (21). Mann-Whitney U-test was used to compare the radiomic features of the study groups. The Pvalue was adjusted using a Bonferroni correction approach for multiple tests [P-value (0.05) divided by the number of features (82)] and the corrected P-value of < 0.00061 was considered to be statistically significant. Logistic regression classifiers were then trained with individual features. Stratified 5-fold cross-validation was used to determine the mean area under the curve (AUC), mean accuracy, and 95% confidence intervals (CIs). Features with a P-value < 0.00061, AUC >0.5, and accuracy >0.7 were retained. In addition, principal component analysis (PCA) was used to identify highly correlated features and reduce feature redundancy. PCA reduces a large number of features into a small number of principal components (PCs). Components that explained 90% of the cumulative variance were retained. Lastly, to find the best machine learning (ML) algorithm, PCs were used as an input to test and train the following ten classifiers: Random Forest, Logistic Regression, Support Vector Machine, Decision Tree, Gaussian Process Classifier, Stochastic Gradient Descent, Perceptron Classifier, Passive Aggressive Classifier, Neural Network Classifier and K-neighbors Classifier with stratified 5fold cross-validation.

Conventional Metrics Diagnostic Utility
The results are relatively different by applying the Mann-Whitney U-tests on the conventional metrics of the different study groups for each segmentation separately. Predictably, for segmentation A, the SUV min had the highest AUC and greatest accuracy due to specifying SUV >2.5 as the minimum value for the patient group, while for segmentation B, the highest performance was for TBR max (see Figure 1). However, for both segmentations, the AUC and accuracy of the TBR max were relatively high and had similar results regardless of the segmentation approach (AUC 0.96; accuracy 0.88-0.89 for segmentation A & B, respectively). This slight difference in TBR max results between both segmentations came from the difference in the number of participants in the patient group who met the criteria for each segmentation.

Individual Radiomic Features Diagnostic Utility
From the Mann-Whitney U-tests, for segmentation A: 40 of the 75 radiomic features and for segmentation B: 61 of the 75 showed statistically significant differences between patients and controls, with a P-value < 0.00061. The five best radiomic features based on P-values for both segmentations are shown in Table 1. After applying a logistic regression classifier, only 22 radiomic features for segmentation A and 35 radiomic features for segmentation B fulfilled the following criteria: P-value < 0.00061, AUC >0.5, and accuracy >0.7. The AUC and accuracy (95% CI for each criterion) with stratified 5-fold cross-validation of the five best-performing radiomic features based on the AUC value are shown in Figure 2. All values of radiomic features and conventional metrics for both segmentations are provided in Supplementary Material 2.

Principal Component Analysis and Machine Learning
As the SUV-related metrics tend to overperform, and to study the performance of non-first order features, the SUV-related metrics were excluded from the PCA. By applying PCA, five PCs were retained to explain 90% of the information. These PCs were used to test and train the ML classifiers. Most of them had AUCs and accuracies ≥0.8.
For segmentation A, all classifiers showed high performance in terms of AUC (95% CI 0.88-1.00) and accuracy (95% CI 0.87-1.00), with values >0.9. A k-neighbors and neural network classifiers showed the highest AUC and greatest accuracy, with values equal to 1.00, as shown in Figure 3.
For segmentation B, there are four classifiers with AUCs and accuracies ≥0.8, Figure 3. However, the gaussian process

DISCUSSION
This study aimed to explore the diagnostic utility of radiomic features compared to conventional metrics to distinguish between study groups and find the best performance ML classifier to create an automated model. From segmentation A, some conventional metrics like SUV min showed high performance individually. These results were predictable as they are affected by the distribution of voxel intensities within the ROI, one of the criteria for including the patients at the first place. In addition, these features cannot be relied upon because they are greatly affected by the success of glucose suppression in normal cardiomyocytes. TBR max was the most reliable metric over other conventional metrics among both segmentations. Although the TBR max is sensitive to noise and it is not necessarily easy to harmonize across different scanners and imaging centers, types of data, and parameters, this is not the case in this study as datasets originated from a single scanner and institution. Therefore, when comparing TBR max with those of the five-best performance radiomic features, the superiority of TBR max over the rest of the features can be clearly seen. This outcome supports any previous studies that utilized TBR max . From segmentation A, by comparing the diagnostic utility of individual radiomic features, GLSZM-Large Area High Gray Level Emphasis radiomic feature showed the best performance in terms of AUC and accuracy. This feature measures the proportion in the image of the joint distribution of larger size zones with higher gray level values. This means there is a difference in gray level zones between patients and controls. However, it cannot be reliable due to the criteria of this segmentation approach that is based on SUV threshold and TBR max . On the other hand, from segmentation B, the best performing radiomic feature was GLDM_Dependence Non-Uniformity with AUC (0.87) and accuracy (0.83). This feature measures the heterogeneity in the ROIs. The values of this feature are higher in sarcoid patients than controls which illustrates more heterogeneous regions in the group of patients. In addition, many other features measure heterogeneity with high AUCs and accuracies. These features look at the spatial relationships rather than voxels values themselves. However, these features had large error bars, unlike the TBR max which had very small bars regardless of the segmentation approach.
Several studies of different diseases advocated the importance of radiomic analysis to predict outcomes (22,23). However, the findings across these studies are not replicated; instead, they are conflicted. Technical issues may illustrate this difference in results among studies, such as ROI size, scanner resolution, reconstruction, and segmentation algorithms, or any other unrevealed factors. High scanner resolution and large number of voxels can affect some radiomic features by increasing their values (24). In terms of segmentation algorithms, numerous studies indicated that using different segmentation methods gave close results in survival analyses (23,25). In addition, Cheng et al. (23) argued that no significant difference exists between radiomic features when using different segmentation methods, unlike SUV max and SUV mean . They reported, in addition, that the effect of utilizing different attenuation correction methods on radiomic features was not significant. At the same time Yip et al. (26) had contrasting results, as some of the features were affected by the attenuation correction method. However, in this study, there was a clear difference between radiomic features when using different segmentation approaches. This may be due to the different sizes of ROIs and the voxel intensities included in each segmentation. Applying the approach of segmentation A, it can provide a good differentiation between study groups based on the conventional metrics such as SUV min and TBR max . However, this approach can be influenced by observer experience, especially for cases with very small hotspots. Conversely, segmentation B approach is more robust and efficient.
This study is subject to some limitations. First, the sample size is relatively small, and more extensive studies are needed to confirm these results. This is of great significance to prevent overfitting and type I errors. Applying a Bonferroni correction and dimensionality reduction techniques resulted in reducing the effect of this issue. In addition, the lack of an automated segmentation, a segmentation reference to compare with, unavailability of an independent clinical gold standard to validate the performance of the model that was trained on initial input data are other limitations for this study. In addition, the selection of only one focal lesion per patient in segmentation A was considered a limitation of this approach. Furthermore, the models proposed in this study should be validated in normal controls showing nonspecific physiological uptake. This study showed uncertainty results of radiomic features and expanding the study to test the reproducibility of the results is required. New knowledge gained from this study is that using radiomic analysis does not provide any additional information related to disease activity in these patients. However, building an automated model regardless of the strategies used for glucose suppression and/or observer experience may prove helpful in further studies. Furthermore, in this study, the MRI acquisitions were not utilized, except for providing anatomical information. In this study the main goal was the radiomic features on PET; the designated tool for CS.

CONCLUSION
Radiomic analysis of PET data may not be a useful approach to detect CS. Several radiomic features that were not related to first-order tracer uptake showed high AUC and accuracy with P-value < 0.00061. However, by measuring AUCs and accuracies, large error bars can weaken the results. TBR max showed its superiority over all other conventional and radiomic features in both segmentation approaches. This methodology needs to be validated further in normal control subjects showing non-specific physiological uptake.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.

ETHICS STATEMENT
This study was conducted with the approval of the Institutional Review Board at Mount Sinai. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
NM segmented all the datasets using 3D slicer software, analyzed the data using PyRadiomics, performed the statistical analysis, analyzed the results, and wrote the manuscript. GS shared datasets, reviewed segmentations, and helped in modifying code as well as in the guidance of the project. LD wrote python code and helped to modify it and provide essential guidance on how to perform the optimization of the radiomic analysis, and machine learning approaches. MT facilitated the availability of data. MT, ZF, and PR contributed to reviewing the manuscript and the overall guidance of the project and data. ZF is the PI of the NIH grant. CT supervised the specific study and helped in restructuring and reviewing the manuscript. All authors contributed to the article and approved the submitted version.
FUNDING NM is fully funded by Taif University, Saudi Arabia. GS and PR are supported by NIH grant R01HL071021. LD is fully funded by the EPSRC Centre for Doctoral Training in Tissue Engineering and Regenerative Medicine: Innovation in Medical and Biological Engineering -grant number EP/L014823/1.