Diagnostic Performance of 2D and 3D T2WI-Based Radiomics Features With Machine Learning Algorithms to Distinguish Solid Solitary Pulmonary Lesion

Objective To evaluate the performance of 2D and 3D radiomics features with different machine learning approaches to classify SPLs based on magnetic resonance(MR) T2 weighted imaging (T2WI). Material and Methods A total of 132 patients with pathologically confirmed SPLs were examined and randomly divided into training (n = 92) and test datasets (n = 40). A total of 1692 3D and 1231 2D radiomics features per patient were extracted. Both radiomics features and clinical data were evaluated. A total of 1260 classification models, comprising 3 normalization methods, 2 dimension reduction algorithms, 3 feature selection methods, and 10 classifiers with 7 different feature numbers (confined to 3–9), were compared. The ten-fold cross-validation on the training dataset was applied to choose the candidate final model. The area under the receiver operating characteristic curve (AUC), precision-recall plot, and Matthews Correlation Coefficient were used to evaluate the performance of machine learning approaches. Results The 3D features were significantly superior to 2D features, showing much more machine learning combinations with AUC greater than 0.7 in both validation and test groups (129 vs. 11). The feature selection method Analysis of Variance(ANOVA), Recursive Feature Elimination(RFE) and the classifier Logistic Regression(LR), Linear Discriminant Analysis(LDA), Support Vector Machine(SVM), Gaussian Process(GP) had relatively better performance. The best performance of 3D radiomics features in the test dataset (AUC = 0.824, AUC-PR = 0.927, MCC = 0.514) was higher than that of 2D features (AUC = 0.740, AUC-PR = 0.846, MCC = 0.404). The joint 3D and 2D features (AUC=0.813, AUC-PR = 0.926, MCC = 0.563) showed similar results as 3D features. Incorporating clinical features with 3D and 2D radiomics features slightly improved the AUC to 0.836 (AUC-PR = 0.918, MCC = 0.620) and 0.780 (AUC-PR = 0.900, MCC = 0.574), respectively. Conclusions After algorithm optimization, 2D feature-based radiomics models yield favorable results in differentiating malignant and benign SPLs, but 3D features are still preferred because of the availability of more machine learning algorithmic combinations with better performance. Feature selection methods ANOVA and RFE, and classifier LR, LDA, SVM and GP are more likely to demonstrate better diagnostic performance for 3D features in the current study.


INTRODUCTION
A solitary pulmonary lesion (SPL) is one of the most common findings on chest radiographs and computed tomography (CT). An increasing number of pulmonary nodules are detected by CT with the improvement in lung cancer screening. However, most of these positive detections are not cancerous (1). The high false-positive rate can lead to a waste of medical resources, additional radiation exposure, unnecessary patient anxiety, and so on. Recent advances in magnetic resonance imaging (MRI) techniques make it possible to use lung MRI in routine clinical practice. Published evidence showed that lung MRI could be a potentially effective screening tool because its performance was comparable with that of low-dose CT (2), even with a lower falsepositive rate for nodule detection (3). A conventional sequence, such as T2 weighted imaging(T2WI), has the potential to detect pulmonary nodules no less than 6 mm in diameter (4), which is essential for screening. However, as a morphological sequence, it may have limited value in distinguishing malignant from benign SPLs.
Radiomics is an emerging field that extracts a large number of quantitative features from medical imaging and quantifies tumor heterogeneity related to cellularity, necrosis, and angiogenesis in the tumor microenvironment (5). Therefore, radiomics provides the possibility for early and accurate diagnosis of SPLs. Radiomics can increase the diagnostic accuracy of baseline CT (6). In addition, studies have shown the potential of radiomics based on CT and MRI in distinguishing pulmonary lesions (7,8). However, MR radiomics research focusing on differentiating SPLs has not yet been reported. Also, the performance of 2D and 3D CT features in pulmonary lesions has been shown to be controversial in different studies (9,10). However, the performance of 2D and 3D MR features as well as their corresponding optimal machine learning methods in distinguishing SPLs has not been discussed. These issues are crucial for further generalization of MR radiomics in clinical research and application in the lung.
The present study aimed to develop and validate a T2WI-based radiomics classifier to differentiate between malignant and benign SPLs. In addition, different machine learning methods were evaluated to achieve the best performance. Furthermore, 2D and 3D features and their combination with clinical features were compared.

Data Cohort
This retrospective study was approved by the local ethics committee of the hospital, which waived the need for patients' informed consent. Preoperative MRI data of 231 patients with chest lesions from November 2015 to April 2018 were analyzed. The inclusion criteria were as follows: (a) lesions were measurable on previous CT scan or T2-weighted imaging; (b) no contraindication for MR examinations; and (c) patients received no therapies or antiinflammatory therapies at least 2 weeks before the MRI scan, and lesions showed no shrinkage. The exclusion criteria were as follows: (a) operations or biopsies were not available (n = 27); (b) multiple lesions were reported (n = 42); and (c) mediastinal or pleural neoplasms were found (n = 30).

Lesion Segmentation
Mass segmentation was performed to select the entire tumor using open-source software (ITK-SNAP v. 3.6.0, http://www.itksnap.org). Regions of interest (ROIs) of lesions were segmented manually by the consensus of two radiologists with 3 and 8 years of experience ( Figure 1). The ROIs included the whole tumor and excluded visible air regions.

Extraction of Features
The radiomics features were extracted by the Philips Radiomics tool (Philips Medical Systems, Shanghai, China) based on pyRadiomics (11). The hyper-parameters were set to default parameters of the PyRadiomic. The details were described on the website: https:// pyradiomics.readthedocs.io/en/latest/features.html. For each ROI, a total of 1,692 3D and 1231 2D radiomic features, including direct features, indirect features, Wavelet transform features, and Laplacian of Gaussian filtered features, were extracted as described in a previous study (12). The 2D features were generated using the slice with the maximum area in 3D ROI. The basic clinical data, sex and age, were included as clinical features. The flow chart for the data processing is displayed in Figure 2.

Radiomics Feature Selection and Classifier Building
Building one machine learning model usually consisted of the following steps (1): normalizing each feature to avoid the effect of the scale (2); reducing the dimension of the feature space to remove the information of no use (3); selecting features from the remained features according to the label; and (4) training a classifier to map the selected features onto the diagnosis. In each step, different methods could be selected to make a machine learning pipeline for the final diagnosis.
Three normalization methods [Min-max Normalization (norm-unit), Z-Score Normalization (norm0center), Mean Normalization (norm0centerunit)], two dimension reduction algorithms [principal component analysis (PCA) and Pearson correlation coefficient (PCC)], three feature selection methods [analysis of variance (ANOVA), relief, recursive feature elimination (RFE)], and 10 different classifiers [support vector machine (SVM), auto-encoder (AE), linear discriminant analysis (LDA), random forest (RF), logistic regression (LR), LR-least absolute shrinkage and selection operator (Lasso), Adaboost (AB), decision tree (DT), Gaussian process (GP), naive Bayes (NB)] with 7 different feature numbers (confined to 3-9) were used to build 1260 models for the diagnosis. These methods were chosen due to their popularity in the literature. More details of the combination of the pipelines are illustrated in Figure 3. The number of features was constrained to less than 10% of the   training sample size to build a robust machine learning model (13). The ten-fold cross-validation on the training dataset was applied to choose the candidate machine-learning pipeline. The pipeline was selected as the optimal final model when the area under the receiver operating characteristic (ROC) curve (AUC) difference between the training and validation sets on crossvalidation was less than 0.1, and the AUC in the test set was the highest. Grid search is used for hyperparameter tuning. The search grid of each classifier was documented in Supplementary Table 1.

Statistical Analysis
The clinical characteristics between the training and testing sets were compared using the Student t test for continuous variables and the chi-squared (c 2 ) test for categorical variables. If the counting variable had a theoretical number <10, it was obtained by Fisher's exact probability test. The aforementioned processes were executed in R software version 3.0.2 (The R Project for Statistical Computing, Vienna, Austria; http://www.rproject.org). A P value of <0.05 indicated statistical significance. The performance of models was evaluated using the ROC curve, precision-recall(PR) plot, and Matthews Correlation Coefficient. The AUC-ROC and AUC-PR were calculated for quantification. The accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were also calculated at a cutoff value that maximized the value of the Youden index. Also, the estimation was boosted 1000 times to give the 95% confidence intervals (CIs) of AUC-ROC. The aforementioned processes were implemented with FeAture Explorer (14) (FAE, v0.2.5, https://github.com/salan668/FAE) on Python (3.6.8, https://www.python.org/), which is an opensource software based on scikit-learn (v0.22) (15).

Clinical Characteristics
In general, a statistically significant difference was found in age (57.54 ± 10.24 vs. 49.23 ± 14.90, P < 0.001) and sex (male/female ratio 69:24 vs. 20:19, P = 0.01) between the malignant and benign groups. The clinical features of training and test cohorts are summarized in Table 1. No significant differences in lesion diameter were found between the training and test sets. Malignant tumors were more common in upper lobes, men, and elderly people in the training set (P = 0.039, 0.04, and 0.001, respectively) and showed a similar trend in the test set in an insignificant manner (P = 0.667, 0.121, and 0.13, respectively).

Comparison of Different Machine
Learning Classification Models Figure 4 shows the AUC heat map of 2D and 3D features with different machine learning methods. The number of models with AUC greater than 0.7 in both the validation and test groups was used as an evaluation index. A total of 140 models based on 3D features (n = 129) and 2D features (n = 11) showed AUC greater than 0.7 in both groups. For dimension reduction algorithms, PCA (n = 111) showed higher performance than PCC (n = 29). In terms of normalization, min-max, Z-score, and mean normalization had similar performance in 3D features, while models using only the Z-score (n = 11) showed AUC >0.7 for 2D features. For feature selection, ANOVA (n = 80) performed the best followed by RFE (n = 60), while relief had poor performance (n = 0) in the dataset. As for classifiers, SVM, LDA, LR, GP, and NB performed better for 3D features, while RF, AB, and GP performed better for 2D features ( Table 2).

Model Performance
The AUC-ROC and AUC-PR of different models are shown in   Table 3. The hyperparameter settings of each classifier used in final models were provided in Supplementary Table 2.

DISCUSSION
The identification of optimal machine learning methods is essential for stable and clinical application (16). The present study provided a comprehensive and detailed assessment of machine learning approaches and explored the diagnostic value of multiple models including 2D, 3D radiomics models, and  combinations of clinical and radiomics models based on MR T2WI in noninvasively differentiate SPLs. Our results demonstrated that the T2WI-based radiomics model showed potential in differentiating malignancy from benign SPLs. The 3D radiomics features were better than the 2D features in differentiating SPLs. The optimal machine learning methods were not consistent in different scenarios or with different features included. Both 2D and 3D features have been employed in previous radiomics researches. A previous CT study suggested that 2D features were superior to 3D features in predicting the prognosis of non-small cell lung cancer (10). However, in some other studies, 3D features demonstrated better predictive performance (17,18). In the present study, the number of extracted 2D radiomics features was less than that of 3D features because 2D features were extracted based on a single slice, thus losing the spatial information within lesions. Therefore, some features reflecting the spatial distribution of voxels became unavailable. The study found that the radiomics signature derived from 3D features outperformed the signature from 2D features, indicating that the 3D volumetric ROI contained more comprehensive information than 2D ROI and therefore had a better diagnostic performance. Although joint 2D and 3D features showed a higher AUC in a previous  study (9), they failed to show superiority over 3D features in the present study. Instead, the performance of joint features was similar to that of 3D features. This finding suggested that the joint features contained information on both 2D and 3D features, while 2D features could not provide new information to 3D features in the present cohort. Accordingly, the classification performance of the joint model failed to show improvement. In this study, the dimensionality reduction method of PCA was better than PCC, probably because each feature was linearly independent after PCA. Therefore, information could be expressed with fewer features, and thus the performance was stable. However, PCA made variables less interpretable. Therefore, PCA was not suitable for cases where feature "interpretability" was emphasized. In addition, In this study, the three normalization methods (min-max, Z-score, and mean normalization) showed little difference in 3D features, indicating that all these three can be used as effective normalization methods.
Our results showed that the feature selection methods ANOVA and RFE, and the classifier LR, LDA, SVM and GP yield relatively better diagnostic performance for 3D features compared with other methods. This explains the optimal machine learning approach reported in some previous studies (14,19). Song et al. (14) reported that ANOVA feature selection and an LDA classifier yielded the highest AUC in classifying the clinical-significant prostate cancer (CS PCa) and non-CS Pca. Wang et al. (19) reported that RFE combined with SVM performed the best in distinguishing benign and malignant pulmonary lesions. These also suggested that the optimal machine learning methods were not consistent in various scenarios. Besides, we found that the optimal machine learning strategy was nonunique and different among 2D, 3D, and joint features, indicating that the optimal method might vary depending on the features included.
The lesion diameter did not differ between benign and malignant groups. This is due to the inclusion of inflammatory lesions in the benign group, which could be patchy and thus have a large diameter. Besides, the results showed that elderly patients and men were predisposed to lung malignancies, which was consistent with the findings of a previous study (20). Integrating these clinical data into the radiomics model further increased the accuracy of the model, indicating that clinical and radiomics features contained complementary information needed for differential diagnosis. However, the improvement after adding clinical data was not significant in the present cohort. This might be because the difference in sex and age between malignant and benign groups was not significant in the test dataset. The radiomics models (especially with 3D features) still performed well under such circumstances, which suggested that the   radiomics models had the potential to differentiate SPLs in patients with an atypical clinical history (e.g., lung cancer in young patients). The present study found that some models had pretty high AUC in the training group but relatively lower AUC in the validation or test group (e.g., the classifier RF and AB achieved AUC of 1.0 in the training group). This could be attributed to overfitting and should be avoided. This finding confirmed that it was necessary to have independent datasets to test the real performance of the developed radiomics model. Based on the experience, it was preferred that the difference in AUC between the training and validation or test groups was less than 0.1, indicating that the model was more stable. In the present study, the final models had good discrimination efficiency in the training cohort and showed similar performance in the validation and test groups, suggesting that the models developed in this study were robust.
Previous studies conducted on CT demonstrated that the radiomics model was helpful in distinguishing pulmonary lesions (8,(21)(22)(23). Chen et al. (23) found that the accuracy of the radiomics signature in benign or malignant classification was 84% with a sensitivity of 92.85% and a specificity of 72.73%. Choi et al. (22) developed a radiomics model with an accuracy of 84.6%, which was 12.4% higher than that of lung-RADS. The performance of the proposed model in the present study was relatively satisfactory and promising, which was similar to that reported in CT radiomics studies. MRI showed great potential in lung application with the advantage of radiation-free and multiparametric imaging. Recently, pulmonary nodule characterization using MR is recommended for clinical use (24). Nonetheless, the application of MR radiomics in assessing lung diseases is in the initial stage. Therefore, the potential clinical significance of our research is that, for one thing, it provides a non-invasive method that may help to increase the accuracy of MR routine sequence in the differentiation of SPL, for another, it lays a theoretical basis for more clinical applications of MR radiomics in the lung in the future.
This study had several limitations. First, the retrospective study design was subject to potential selection bias. Second, PCC threshold 0.9 has been used in previous studies to screen radiomics features (25,26). In this study, we used a threshold value lower than 0.9 to more rigorously filter out redundant features. However, a seemingly arbitrarily chosen value of 0.86 was used since in the software we used, PCC can only be set as 0.86 by default. Third, the study included only T2WI to reduce the impact of parametric changes because its scanning parameter remained consistent in all patients while other sequences did not. However, other sequences such as T1W contrast enhancement, diffusion-weighted imaging, and ultrashort TE MRI (27) could provide valuable information and should be included in future studies. Fourth, lesion segmentation was not automatic in the present study and thus could be susceptible to potential human error. Finally, the sample size was relatively small. Therefore, a study with larger sample size and external validation from another institute is needed.
In conclusion, the present study developed and validated a radiomics model based on T2WI that might serve as a promising tool for noninvasive discrimination of SPLs. The 3D features were better than 2D features in differentiating SPLs and performed well in populations with different clinical characteristics. Therefore, 3D segmentations are recommended for further MR radiomics researches. Combining radiomics features with clinical data could further improve model performance. Nonetheless, the optimal machine learning method might not be consistent in different scenarios or with different features involved.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article. Further inquiries can be directed to the corresponding author.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the First Affiliated Hospital of Guangzhou Medical University. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements. Written informed consent was not obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.

AUTHOR CONTRIBUTIONS
QW and XL conceived the project. QW and JZ completed the article writing. XX, JH, PW, and YP completed data collection and statistical analysis. QW, TZ, and JS completed the Radiomics analysis and chart making. YS and GY provided technique support. All authors contributed to the article and approved the submitted version.