Preoperatively Estimating the Malignant Potential of Mediastinal Lymph Nodes: A Pilot Study Toward Establishing a Robust Radiomics Model Based on Contrast-Enhanced CT Imaging

Purpose To establish and validate a radiomics model to estimate the malignancy of mediastinal lymph nodes (LNs) based on contrast-enhanced CT imaging. Method In total, 201 pathologically confirmed mediastinal LNs from 129 patients were enrolled and assigned to training and test sets. Radiomics features were extracted from the region of interest (ROI) delineated on venous-phase CT imaging of LN. Feature selection was performed with least absolute shrinkage and selection operator (LASSO) binary logistic regression. Multivariate logistic regression was performed with the backward stepwise elimination. A model was fitted to associate mediastinal LN malignancy with selected features. The performance of the model was assessed and compared to that of five other machine learning algorithms (support vector machine, naive Bayes, random forest, decision tree, K-nearest neighbor) using receiver operating characteristic (ROC) curves. Calibration curves and Hosmer-Lemeshow tests were used to assess the calibration degree. Decision curve analysis (DCA) was used to assess the clinical usefulness of the logistic regression model in both the training and test sets. Stratified analysis was performed for different scanners and slice thicknesses. Result Among the six machine learning methods, the logistic regression model with the eight strongest features showed a significant association with mediastinal LN status and the satisfactory diagnostic performance for distinguishing malignant LNs from benign LNs. The accuracy, sensitivity, specificity and area under the ROC curve (AUC) were 0.850/0.803, 0.821/0.806, 0.893/0.800, and 0.922/0.850 in the training/test sets, respectively. The Hosmer-Lemeshow test showed that the P value was > 0.05, indicating good calibration, and the calibration curves showed good agreement between the classifications and actual observations. DCA showed that the model would obtain more benefit when the threshold probability was between 30% and 90% in the test set. Stratified analysis showed that the performance was not affected by different scanners or slice thicknesses. There was no significant difference (DeLong test, P > 0.05) between any two subgroups, which showed the generalization of the radiomics score across different factors. Conclusion The model we built could help assist the preoperative estimation of mediastinal LN malignancy based on contrast-enhanced CT imaging, with stability for different scanners and slice thicknesses.


INTRODUCTION
Mediastinal lymph nodes (LNs) are the most common metastatic targets of lung cancer, esophageal cancer and other malignant tumors. LN metastasis has been suggested to be very important for tumor staging, treatment plan selection and prognosis prediction (1)(2)(3)(4). Preoperative staging of the mediastinum is an essential task (5). Preoperative staging is helpful for clinicians to fully understand the patient's condition and to make better diagnosis and treatment decisions. However, it is difficult to estimate mediastinal LN metastasis before surgery in the clinic. Metastasis does not necessarily cause lymphadenopathy. Some LNs less than 1 cm are finally confirmed as metastasis, while some with obvious enlargement are confirmed as chronic lymphadenitis or reactive hyperplasia.
Endobronchial ultrasound-guided transbronchial needle aspiration (EBUS-TBNA) is the most effective diagnostic method for estimating the malignancy of mediastinal LNs (6)(7)(8)(9). However, EBUS-TBNA is not popular enough in developing countries. It is also an invasive procedure with the possibility of complications (10)(11)(12) and false negative results (13)(14)(15)(16). It is meaningful to find a method to assist clinicians in determining the status of mediastinal LNs preoperatively.
It is worth noting that CT scan can be performed before surgery. Some researchers point out that radiomics is very promising in preoperative staging of the mediastinum (5). "Radiomics" refers to an approach that can extract innumerable quantitative features using advanced feature analysis with high throughput from digital medical images, including computed tomography (CT), magnetic resonance (MR), and positron emission tomography (PET) images (17). Radiomics has been indicated to be effective for the diagnosis, treatment evaluation and prognosis prediction of tumors (18)(19)(20), as well as tumor phenotype decoding (18,21). Recent evidence suggests that radiomics features have the capability to predict the distant metastasis of lung adenocarcinoma (22) and LN metastasis of colorectal cancer (23). Several researchers have attempted to estimate the malignancy of primary lung lesions with radiomics (22,24) or mediastinal LNs with texture (25,26). The results were promising. A recent research confirms that the CT radiomics models show good performances for the diagnosis of LN metastasis in non-small cell lung cancer patients (27). However, with different equipment and scanning protocols in different studies, the repeatability and applicability of the model is relatively low.
In our study, a robust model with validation by stratified analysis (28) was established for malignant estimation based on the radiomics analysis of mediastinal LNs on contrast-enhanced CT. We hypothesized that the robust model could be effective in distinguishing the malignancy of LNs, and it could be generally applied on different scanners or slice thicknesses with stable performance.

Patients
The retrospective study was approved by the institutional review board of the First Affiliated Hospital of China Medical University, and the requirement for informed consent was waived. A review of medical records was performed in accordance with the guidelines of the ethical review committee of the institution. One hundred and twenty-nine patients (73 men, 56 women, age: 18-93, mean age: 56, median: 57) with both EBUS-TBNA and contrast-enhanced chest CT from July 2014 to July 2018 were included. The inclusion criteria were as follows: (i) patients with pathological results from EBUS-TBNA; and (ii) contrast-enhanced chest CT performed within two weeks before EBUS-TBNA. The exclusion criteria were as follows: (i) patients with primary malignancy but negative EBUS-TBNA results; (ii) lesions difficult to delineate; and (iii) lesions difficult to match with the biopsy on the image. Most of the biopsied LNs were at 4R, 4L and 7 (4R, right lower paratracheal; 4L, left lower paratracheal; 7, subcarina). LNs were classified as benign or malignant with pathology as the gold standard.
The flowchart of the enrollment of patients and data assignment is shown in Figure 1.

CT Image Acquisition
CT scans were performed on Siemens (SOMATOM Force, 192section dual-source CT), GE (Discovery CT 750 HD, 64 multidetector CT system), Toshiba (Aquilion One, 320 slice) and Philips (Brilliance iCT, 256 slice) systems. The patients were maintained in the supine position, their arms were raised, and they held breath after deep inhalation for acquisition. The scan parameters were as follows: 120 kVp, 100-200 mA, 350 mm×350 mm field of view, and matrix of 512 × 512. Venous-phase contrast-enhanced CT was performed after a 30−40 s delay following the intravenous administration of 70−90 mL of nonionic contrast-enhanced medium (Iproamine) and 35-45 mL of normal saline at a rate of 2.5−3.0 mL/s with a power injector. Images were retrieved separately from the Picture Archiving and Communication System (PACS) (IMPAX, AGFA, Belgium).

Feature Extraction
The study workflow is shown in Figure 2. CT images were extracted from PACS and imported into ITK-SNAP software (version 3.8.0; http://www.itksnap.org). A thoracic radiologist (reader 1) with 10 years' experience, who was blinded to the pathological results, reviewed the CT images on mediastinal window settings (W: 350, L: 50) and drew the region of interest (ROI) on the venous-phase contrast-enhanced images by using ITK-SNAP. The ROI was manually drawn along the boundaries of the LN on the maximal 2D axial slice.
Six months after the initial segmentation by reader 1, 50 LNs were randomly selected and resegmented by reader 2 with 7 years' experience, who was blinded to the pathological results, in order to assess the interreader agreement in the extracted radiomics features. The intraclass correlation coefficient (ICC) was calculated.
Both the raw digital imaging and communications in medicine (DICOM) images and the corresponding ROIs were imported into A.K. software (Artificial Intelligence Kit, GE Healthcare, China) for the automatic quantitation of radiomics features compliant with the Image Biomarker Standardization Initiative (IBSI) (32).

Radiomics Model Building
To reduce overfitting, least absolute shrinkage and selection operator (LASSO) regression analysis with 10-fold crossvalidation was conducted and repeated 100 times before modeling to identify related features with nonzero coefficients to differentiate malignant and benign LNs in the training set. Multivariate logistic regression with the backward stepwise elimination method was used to select the strongest features associated with the probability of malignancy of LNs.
The radiomics score was calculated for each LN via a linear combination of the selected strongest features that were weighted by their respective coefficients. To compare different machine learning algorithms in distinguishing malignant LNs from benign LNs, five other machine learning models, including support vector

Radiomics Model Evaluation
The diagnostic performance of different machine learning models for LN metastasis was evaluated by using receiver operating characteristic (ROC) curves and calculating the diagnostic accuracy in both the training and test sets. The performance of the established radiomics score was evaluated with discrimination, calibration and clinical application in both the training and test sets.
1. Discrimination. The discrimination degree (which measures the ability of model to distinguish malignant and benign LNs) of the model was assessed by the area under the ROC curve (AUC). The optimal cutoff of the radiomics score calculated from the training set was applied in the test set. 2. Calibration. The calibration degree (which measures the agreement between the observed outcome frequencies and classified probabilities) of the model was evaluated by plotting the calibration curve. The Hosmer-Lemeshow test was used to determine the goodness of fit of the models, and P values of more than 0.05 were considered well calibrated. 3. Clinical application. To evaluate the potential clinical application of the estimation model, decision curve analysis (DCA) was conducted by quantifying the net benefits at different threshold probabilities.
We used the same machine learning process as above to analyze the non-enhanced images. The progress included drawing ROI, features extraction, modeling and model evaluation.

Statistical Analysis
All statistical analyses were performed using R studio software (version 1.2.1335). LASSO regression was performed using the "glmnet" package. ROC curves were plotted using the "pROC" package. DCA was performed using the "DecisionCurve" function. P values of less than 0.05 (two-sided) were considered statistically significant. The differences in clinical factors and malignancy-related features between the malignant group and benign group in both the training and test sets were assessed by independent t tests according to the distribution type of the data. The chi-squared test was used to compare the significance of the differences between categorical variables. The DeLong test was used to compare the differences in the AUC values of different models.

Pathological Results and Clinical Information
The pathological results and clinical information of 201 mediastinal LNs in 129 patients are shown in Table 1 and Table 2. There was no significant difference (P > 0.05) in malignant and benign LNs between the training set and test set. Within the training set, there were no significant differences (P > 0.05) in gender or age between the benign and malignant groups. Within the test set, the malignant group was older (P = 0.001), and the proportion of males was higher (P < 0.001).

Radiomics Model Building
To build a robust radiomics model, only the features that showed high stability remained for the next step. Two hundred and seventy-four features with an ICC greater than 0.75 remained ( Figure 3).
For feature selection, 26 radiomics features with a nonzero coefficient after the LASSO regression method remained (shown in Figure 4). The eight strongest features among the 26 features were identified after further stepwise elimination in multivariate logistic regression. Then, the radiomics score was established based on the eight features (the formula and interpretation are shown in Supplementary A1), and the calculation was as follows: We additionally used two other methods to select radiomics features. They were Spearman correlation analysis with univariate logistic regression, and Spearman correlation analysis with Gradient Boost Decision Tree (GBDT) and multivariate logistic regression. The classification performances of the two methods were similar to those of LASSO regression. Details are shown in Supplementary A2.

Radiomics Model Evaluation
For all six machine learning algorithms, the AUCs were greater than 0.65 in the test set ( Table 3). We used DeLong test to compare the classification performances of the six machine learning algorithms (please see Supplementary Tables S4, S5).    A B D C the training and test sets (Figures 7A, B). Nonsignificant statistics were achieved in the Hosmer-Lemeshow test in the training set (P = 0.6898) and test set (P = 0.1762). 3. Clinical application. DCA showed that the radiomics model would add more benefit in distinguishing malignant and benign LNs when the threshold probability was between 10% to 100% in the training set or between 30% to 90% in the test set ( Figures 7C, D).
The classification performances of the model of nonenhanced images are reported in Supplementary A3. The classification performances of the model of venous-phase images were better than those of model of non-enhanced images.

DISCUSSION
In the present study, we established a robust radiomics model for distinguishing malignant from benign mediastinal LNs based on contrast-enhanced CT imaging. Stratified analysis demonstrated the stable performance of the radiomics model, which was not affected by different scanners or slice thicknesses. Calibration curve and DCA suggested good fitness and the potential prospect of clinical application.
The accurate evaluation of mediastinal LNs before surgery plays a pivotal role in staging and treatment planning for lung cancer. Ultrasound, CT and PET/CT are the main methods, with size, shape, internal structure and contrast uptake as the main criteria for evaluation. The accuracy for mediastinal LN evaluation is not satisfactory at present. The reasons are as follows. (1) Size is not the gold standard, as smaller LNs might be early metastases. internal structure is limited. (4) The enhancement of metastatic LNs varies. Heterogeneity is an important feature of malignancy, presumably reflecting infiltration, angiogenesis and necrosis (33)(34)(35). Studies of perfusion CT suggested that the blood flow might be more heterogeneous in metastatic nodes than in nonmetastatic nodes (36). However, it was difficult to tell subtle differences by the naked eye, even for experienced radiologists. For further evaluation of the distinct information of the image, including the heterogeneity of the contrast inside the nodes, contrast-enhanced CT was included, and radiomics features were extracted in our study.
Eight potential features remained in our model: a first-order histogram feature (Percentile20), a Haralick feature (Sum Average), GLCM features (GLCMEnergy, Cluster Shade, Inertia), GLRLM features [HighGrayLevelRunEmphasis (HGLRE), LongRunHighGrayLevelEmphasis (LRHGLE)] and a morphological feature (Surface Area). Surface Area was the only selected morphological feature with a relatively high coefficient of 2.504. Surface Area is an approximation of the surface of the ROI in mm 2 , which suggests that the larger the maximum crosssectional area of the LN, the more likely it is to be malignant. The other features are calculated based solely on pixel or pixel pairs and provide meaningful information that is invisible to the naked eye, which reflects the characteristics of heterogeneity in addition to morphological features. Percentile20 showed the strongest negative correlation, with a coefficient of -10.941. It was one of the first-order histogram features used in statistics, indicating that the image intensity value below 20% of observations in a group of observations falls. The GLCM features were the most important radiomics category for the malignancy of LNs, with 3 features included in the model. They represent the joint probability of certain sets of pixels having certain gray-level values. GLCMEnergy was the strongest positive feature with a coefficient of 2.592. Energy measures the homogeneous patterns in the image. A greater energy implies that there are more instances of intensity value pairs in the image that neighbor each other at higher frequencies. Cluster Shade measures the skewness and uniformity of the GLCM, and a higher cluster shade implies greater asymmetry about the mean. Inertia reflects the clarity of the image and the depth of the texture groove, the contrast is proportional to the texture groove, and high values of the groove produce more clarity; in contrast, small values of the groove would result in small contrast and fuzzy images. In our study, inertia showed a negative relation with malignancy.
To stabilize the performance of the radiomics model, several steps were performed in our study. (1) Strict inclusion and exclusion criteria were followed to avoid potential bias. Nine patients with primary malignancy but negative EBUS-TBNA were excluded considering the probability of false negative results of the EBUS-TBNA. Occasionally, small LNs could be biopsied during EBUS, but they are difficult to match or delineate on chest CT. Fifteen patients were excluded for this reason. (2) A.K. software, which has been used and verified in many studies, was applied for radiomics feature extraction. (3) The ICC was calculated to verify the reproducibility of feature extraction between two readers. (4) Multiple evaluations of the efficiency and stability of the model were performed. The AUC, classification probability and DCA were calculated, representing the discrimination, calibration and clinical application, respectively. (5) For a robust radiomics model fit for different scanners or slice thicknesses, the stratified analysis was conducted in the study. The radiomics model was applied to each subset divided by scanners and slice thicknesses and shown to be slightly affected by different scanners or slice thicknesses, suggesting robustness.
Our study had several limitations. First, it was a retrospective single-center study, and the sample size was relatively small, although the images were derived from various scanners, and the stratified analysis was performed among different scanners and different slice thicknesses. The robustness of the classification model needs to be tested in a multicenter dataset. Second, for reproducibility, we manually drew ROIs on one 2D slice instead of 3D slices. It is easy to process and analyze, but it may not represent all the features and characteristics of LNs. Although the ICC suggested reliable interreader agreement for the extracted radiomics features, further study based on 3D images should be performed. Third, no clinical factors were integrated into the model, which might further improve the performance. Fourth, we didn't use deep learning to investigate our data. In future research, we will try to use deep learning to investigate our data. Fifth, we had no external validation set. In order to verify the clinical value of the classification performances, we need external validation set. Sixth, in this work, the distribution of age and gender in the training set and the test set was inconsistent. But there was no statistical difference in the distribution between the training set and the test set. Additionally, within the training set, there were no significant differences in gender and age between the benign and malignant groups. It largely avoided that the trained model was a model for diagnosing gender or age, although there were statistical differences in gender and age within the test set.
In conclusion, we established a robust radiomics model for distinguishing malignant from benign mediastinal LNs based on contrast-enhanced CT imaging, with stability for different scanners and slice thicknesses. The radiomics model has the potential for clinical application.

DATA AVAILABILITY STATEMENT
Derived data supporting the findings of this study are available from the corresponding author LZ on request. Requests to access the datasets should be directed to LZ, zhanglnda@163.com.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the institutional review board of the First Affiliated Hospital of China Medical University. This is a retrospective study, and the requirement for informed consent was waived.

AUTHOR CONTRIBUTIONS
MD and LZ mainly wrote the paper. GH, SL and NL were in charge of designing the method and data acquisition. KX and LZ designed the study. The manuscript was revised by all authors and the final version of the manuscript was approved by all authors. All authors contributed to the article and approved the submitted version.