Advancing NSCLC pathological subtype prediction with interpretable machine learning: a comprehensive radiomics-based approach

Objective This research aims to develop and assess the performance of interpretable machine learning models for diagnosing three histological subtypes of non-small cell lung cancer (NSCLC) utilizing CT imaging data. Methods A retrospective cohort of 317 patients diagnosed with NSCLC was included in the study. These individuals were randomly segregated into two groups: a training set comprising 222 patients and a validation set with 95 patients, adhering to a 7:3 ratio. A comprehensive extraction yielded 1,834 radiomic features. For feature selection, statistical methodologies such as the Mann–Whitney U test, Spearman’s rank correlation, and one-way logistic regression were employed. To address data imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) was utilized. The study designed three distinct models to predict adenocarcinoma (ADC), squamous cell carcinoma (SCC), and large cell carcinoma (LCC). Six different classifiers, namely Logistic Regression, Support Vector Machine, Decision Tree, Random Forest, eXtreme Gradient Boosting (XGB), and LightGBM, were deployed for model training. Model performance was gauged through accuracy metrics and the area under the receiver operating characteristic (ROC) curves (AUC). To interpret the diagnostic process, the Shapley Additive Explanations (SHAP) approach was applied. Results For the ADC, SCC, and LCC groups, 9, 12, and 8 key radiomic features were selected, respectively. In terms of model performance, the XGB model demonstrated superior performance in predicting SCC and LCC, with AUC values of 0.789 and 0.848, respectively. For ADC prediction, the Random Forest model excelled, showcasing an AUC of 0.748. Conclusion The constructed machine learning models, leveraging CT imaging, exhibited robust predictive capabilities for SCC, LCC, and ADC subtypes of NSCLC. These interpretable models serve as substantial support for clinical decision-making processes.


Introduction
Lung cancer ranks among the top causes of cancer-related deaths globally (1).The two major types of lung cancer are small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC).The WHO divides NSCLC, which affects 85% of patients, into three primary categories: adenocarcinoma (40%), squamous cell carcinoma (25-30%), and large cell carcinoma (5-10%) (2,3).Treatments and prognoses vary for NSCLC depending on the histological subtypes (4).For example, Li et al. (5) found that in comparison to non-SQ-NSCLC, ICI monotherapy for SQ-NSCLC led to a noticeably greater survival rate.In addition, Baine et al. (6) showed that SCC is independently linked to an increased risk of death, and patients with SCC who have received SBRT treatment are at an elevated risk of both local and distant failure.In summary, early and accurate diagnosis of NSCLC histological subtypes is essential for the subsequent specific clinical treatment plans.
Until now, the gold standard to diagnose pathological NSCLC subtypes are still CT-guided biopsy and postoperative pathological tissue sections.These methods do have certain drawbacks, though.First of all, these are intrusive tests that have a risk of several complications, including bleeding, air embolism, and pneumothorax (7).In a population-level retrospective study, the percentage of patients who developed comorbidities within 3 days after transthoracic needle biopsy (TTNB) reached 25.8%, with the top three being pneumothorax at 23.3%, hemorrhage, and air embolism (8).Additionally, patients experiencing complications from TTNB demonstrate a heightened likelihood of developing respiratory failure compared to those without complications.Consistent with prior research findings, individuals encountering complications tend to have prolonged hospital stays on average.Simultaneously, smokers, patients of older age, are more prone to postoperative complications (9).Besides, they can end up costing the patients more money and time (10).
Compared to invasive and complex biopsies, a more convenient test is needed to help clinicians make the initial determination of pathological subtypes in patients with NSCLC.The term "radiomics" refers to the automated or semi-automatic postprocessing techniques used to analyze various features extracted from imaging exams, and reveals the correlation between these quantitative features and clinical histology or biomarkers (11).In recent years, numerous research have validated the great potential of machine learning combined with radiomics for accurate recognition of histological subtypes, molecular subtypes, and clinical outcome prediction (12)(13)(14).Similarly, radiomics has demonstrated comparable success in identifying lung cancer pathological subtypes in previous studies (15,16).The great majority of previous studies, however, that predict the pathological subtypes of NSCLC have concentrated on differentiating between the two pathological subtypes of ADC and SCC or the three subtypes of ADC, SCC, and SCLC.Few research have been conducted in the recognition of ADC, SCC, and LCC, and existing studies are still deficient in predicting more precise NSCLC pathological subtypes.As medical imaging technology advances, it is now possible to do radiomics analysis on CT, MRI, and PET data.The radiomics workflow remains mostly consistent across these imaging modalities (17).Yet, for the respiratory system, especially in the field of lung cancer, CT is the most common imaging modality.With less radiation exposure than PET/CT and less time and money spent on CT than MRI, CT still has an irreplaceable place in imaging.
Machine learning (ML) has the potential to greatly enhance the accuracy and efficiency of diagnosis across a broad spectrum of diseases, owing to its capacity for processing extensive datasets.ML algorithms are instrumental in refining radiology diagnostic procedures.Through the analysis of medical imagery and supplementary data, models that are adept at identifying disease markers and patterns empower clinical doctors to achieve greater diagnostic accuracy (18).Even so, several prior studies that employed machine learning for the differentiation of NSCLC histological subtypes failed to tackle the challenge of data imbalance.This oversight potentially skewed the outcomes of the final models in favor of categories represented by a larger volume of data, which might affect the accuracy (19,20).In addition, previous studies have not focused on the interpretability of radiomics, which is not conducive to opening the "black box" of machine learning.
This study aims to build machine learning models based on CT images to noninvasively and accurately predict NSCLC histological subtypes (ADC, SCC, and LCC).In doing so, we addressed the data imbalance issue, which is a common challenge in medical imaging analysis, ensuring a more reliable and robust model performance.Furthermore, we enhanced the transparency and understandability of our models by employing the SHAP (Shapley Additive Explanations) method.This approach not only provided a detailed interpretation of the model's predictive behavior but also illuminated the significance of individual features in the decisionmaking process, thereby contributing to a more informed and trustworthy clinical decision-making framework.

Region of interest interception
In our study, the DICOM Radiation Therapy Structure Set (RTSTRUCT) and DICOM Segmentation (SEG) files were integral for delineating the regions of interest (ROI) within the CT images.These files encapsulate the manual segmentation efforts conducted by experienced radiation oncologists, who meticulously outlined the ROIs pertinent to the NSCLC histological subtypes.
To ensure the utmost accuracy and reliability of the segmentation, we undertook a comprehensive verification process for each annotated ROI.This involved a rigorous review by 2 physicians with more than 3 years of experience in radiology (B.Kuang and M. Zhang), who cross-examined the segmented areas against the corresponding histological findings.The process was iterative, with each ROI being scrutinized for precision in demarcating the tumor boundaries and its conformity to the recognized pathological characteristics of the NSCLC subtypes.

Radiomic features 2.3.1 Radiomic features extraction
Features extraction was based on Python 3.7 and implemented using the PyRadiomics software. 1Radiomic features extracted from CT images are categorized into geometric, intensity, and texture features, each capturing different aspects of the tumor's characteristics within the ROI.Geometric features delineate the 3D shape of the ROI, detailing aspects like the volume, surface area, and the overall spatial configuration of the tumor.They help in understanding the physical dimensions and shape irregularities of the tumor mass.Intensity features, on the other hand, deal with the first-order statistical distribution of voxel intensities inside the ROI, providing insights into the density and uniformity of the tumor tissue through metrics such as mean intensity, standard deviation, skewness, and kurtosis.
Texture features go a step further by describing the secondorder and higher-order spatial distribution of voxel intensities, reflecting the heterogeneity within the tumor.These features are extracted using methods like the gray level co-occurrence matrix (GLCM), which evaluates how often pairs of pixel with specific values and in a specified spatial relationship occur in an image, or the gray level run length matrix (GLRLM), which considers the length of contiguous runs of pixels having the same gray level value.Other methods include the gray level size zone matrix (GLSZM), which assesses the distribution of different-sized zones of similar gray level values, and the neighborhood gray-tone difference matrix (NGTDM), which quantifies the difference in gray-level values between a pixel and its surrounding neighbors.Together, these texture features provide a comprehensive view of the textural patterns and complexity of the tumor, offering crucial insights into its pathological and physiological state.

Radiomic features selection
In the process of radiomic features selection, we first subjected all imaging features to the Mann-Whitney U statistical test to identify significant differences between groups.Only features that demonstrated a statistically significant difference, with a p-value less than 0.05, were retained for further analysis.Following this initial filtration, we applied the Spearman rank correlation coefficient to evaluate the interrelationships among the features.This step was crucial to identify and eliminate highly correlated features, with a threshold set at a coefficient value above 0.9, to reduce redundancy and potential collinearity in the dataset.To refine the feature set further, we conducted oneway logistic regression on the remaining features, again selecting only those with a p-value less than 0.05.This method ensured that the final set of features had both statistical significance and predictive relevance.

Machine learning models
In this study, we developed three distinct machine learning models tailored to the pathological subtypes of NSCLC: ADC, SCC, and LCC.Each model was specifically constructed to predict the likelihood of one of these subtypes based on the radiomic features extracted from the CT images.
To ensure the robustness and accuracy of our predictions, we employed six different classifiers for training each model.These classifiers included Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), eXtreme Gradient Boosting (XGB), and LightGBM (LGBM).Each of these classifiers brings unique strengths and approaches to the modeling process, such as LR's ability to provide linear decision boundaries, SVM's effectiveness in high-dimensional spaces, DT's clear decision rules, RF's ensemble learning for reducing overfitting, XGB's optimization in gradient boosting, and LGBM's efficiency in handling large data sets.The training process included a 5-fold cross-validation on the training set to optimize the model parameters.Following this, an independent test on the validation set was conducted to evaluate the models' performance, ensuring that our predictive models were both robust and generalizable.The average area under the curve (AUC) was utilized to assess the accuracy of predictive models.

Stata analysis
For statistical analysis in our experiment, we employed the R software version 4.2.3.The Shapiro-Wilk test was utilized to evaluate the normality of the measurement data, which were expressed as mean ± standard deviation (SD) for normally distributed data.Non-normally distributed data were presented as medians with first and third quartiles (Q1, Q3), and categorical data were expressed as counts and percentages (n, %).For comparing measured data, we used the independent samples t-test for normally distributed data and the Mann-Whitney U-test for nonnormally distributed data.The chi-squared (χ 2 ) test was applied for comparing count data.Statistical significance was established at a p-value of less than 0.05.The predictive efficacy of the models was assessed using receiver operating characteristic (ROC) curves and accuracy measurements.
The entire dataset of 317 samples was randomly divided into training and validation sets in a 7:3 ratio.To mitigate the challenge of unbalanced data, we implemented the Synthetic Minority Over-sampling Technique (SMOTE).This method is crucial for augmenting the minority class in the dataset by synthesizing new samples, thereby achieving a balance in the class distribution.This balanced dataset was then used to enhance the performance of our classification models ( 22).The prediction model with the highest AUC was selected as the optimal choice.ROC curves were utilized to evaluate the predictive performance of various models in forecasting NSCLC pathological subtypes.In addition, we also visualized the results of the model through SHAP analysis.The importance of each radiomic feature is ranked.

Patient baseline characteristics
In our study, we included 317 patients diagnosed with nonsmall cell lung cancer (NSCLC), featuring a mean age of 68.5 years.The distribution of histological subtypes among these patients was as follows: 51 were diagnosed with ADC, 152 with SCC, and 114 with LCC.The baseline characteristics of these patients, including demographic details, clinical stage, and other relevant clinical parameters, are meticulously cataloged in Table 1.This comprehensive data set provides a foundational understanding of the patient demographics and disease specifics, serving as a critical reference point for the predictive efficacy of the developed machine learning models.

Radiomic feature extraction and selection results
The radiomic analysis of the CT images from the 317 NSCLC patients resulted in the extraction of 1,834 distinct features.These features encompassed a broad spectrum of geometric, intensity, and texture characteristics, providing a comprehensive dataset for subsequent analysis.
To refine this extensive set of features and identify the most predictive ones for each NSCLC subtype, we applied a series of statistical methods.Initially, the Mann-Whitney U test was used to filter out features that showed significant differences between the subtypes, ensuring that the retained features had potential diagnostic value.This was followed by the Spearman's rank correlation analysis, which helped in identifying and eliminating features that were highly correlated with others, thereby reducing redundancy and focusing on those features that offered unique information.For the final stage of feature selection, one-way logistic regression was performed, further narrowing down the feature set to those with significant predictive power, as indicated by p-values less than 0.05.As a result, the feature selection process culminated in the identification of 9 key radiomic features for the ADC group, 12 for the SCC group, and 8 for the LCC group, which was shown in Table 2.These selected features represent the most relevant and informative characteristics for predicting the histological subtypes of NSCLC.To visually represent the inter-relationships and correlations among the selected features,

Distinguish the LCC subtype from all other subtypes
In the construction of models to predict LCC, six algorithms were employed, resulting in varied AUC values in the training set: LR achieved an AUC of 0.702, SVM attained 0.805, RF reached 0.841, DT had 0.837, XGB stood out with 0.908, and LGBM was at 0.834, with their 95% confidence intervals ranging from 0.646 to 0.940 across the different methods.In the testing phase, the performance of these algorithms showed a different pattern of AUC values, indicating varying levels of efficacy in model generalization.LR presented an AUC of 0.532, SVM at 0.501, RF at 0.576, DT scored 0.748, XGB led with 0.789, and LGBM recorded 0.544, with their 95% confidence intervals demonstrating the consistency in model performance under varying conditions.The XGB model emerged as the most effective in differentiating LCC from ADC and SCC in the testing set, showcasing its superior predictive ability among the tested algorithms.Similarly, the performance indicators and ROC curves of the model are shown in Supplementary material 1 and Figure 3.

Distinguish the ADC subtype from all other subtypes
To identify the ADC subtype, models were developed using six algorithms, yielding AUC values in the training set as follows: LR had 0.782, SVM 0.790, RF 0.855, DT 0.814, XGB 0.733, and LGBM 0.762, with confidence intervals ranging from 0.670 to 0.901.In the testing set, AUC values were LR at 0.551, SVM at 0.658, RF at 0.748, DT at 0.597, XGB at 0.616, and LGBM also at 0.620, indicating varying levels of predictive performance.The RF model stood out for its ability to differentiate ADC, achieving an AUC of 0.748 in the testing set, thus proving to be the most effective among The feature correlation matrix obtained by performing Spearman's rank correlation analysis in the SCC (A), LCC (B), and ADC groups (C), respectively.
the algorithms tested for this subtype (Supplementary material 1 and Figure 4).

Model explanation
The XGB model with the best prediction in differentiating SCC from the other two subtypes was used to demonstrate the SHAP plots.Figure 5A depicts the ranking of importance of the 12 features.The two most important features were log sigma 1.0 mm 3D firstorder 90Percentile and lbp 3D m2 glszm LowGrayLevelZoneEmphasis.The decision-making process of the XGB model for two patients is depicted using the SHAP force diagram (Figures 5B, C).The score calculation begins with E[f(x)] and then sums the SHAP values, with yellow representing an increased probability or purple representing a decreased probability of squamous carcinoma, ending with the individual prediction.Similarly, the explanatory display of the LCC groups is shown in Figures 5D-F (It is worth noting that since the algorithm mechanism of RF cannot display the same SHAP and force diagram as above, the ADC group is not displayed.).

Discussion
Accurate identification of histological subtype of patients with NSCLC has important implications for both clinical therapeutic approaches and patient prognosis.In our study, we developed CT image-based ML models for distinguishing different pathological subtypes of NSCLC patients and compared the performance of each ML model.Ultimately, our model can assist clinicians in diagnosis   and suggest to them those imaging features that are important in distinguishing NSCLC pathological subtypes.The pre-selected features screened for model building belong to first order features and texture features so as the two most important features for distinguishing SCC and the other two subtypes in the XGB model, which were similar with the previous studies (20).CT images have good spatial resolution and can represent different tissue structures in grayscale, while the texture of CT images may be related to the heterogeneity of the tumor and may predict the biological behavior of the tumor (23).The XGB model performed best both in differentiating SCC and LCC, reaching an AUC of 0.848 and 0.789, respectively.The RF model outperformed the other five models, discriminating between ADC and non-ADC with an AUC of 0.748.There may be several reasons for this phenomenon.One reason for this is that no algorithm can maintain the best in all datasets (24).The above evidence suggested that our models were reliable.
Nowadays, there are many studies based on imaging to differentiate histological types of lung cancer.Numerous studies have shown that imaging features can be used to identify the histological type of NSCLC patients.Our study built three prediction models for ADC, SCC, and LCC with AUCs of 0.748, 0.848, and 0.789, respectively, and further demonstrated the feature importance ranking of the models as well as the decision-making process.To date, published studies have focused on distinguishing ADC from SCC, or SCLC.A study by Zhao et al. (30) established a Mobilenet v2 model for discriminating ADC and SCC, yielding an AUC value of 0.767, slightly lower than ours.In this study, we innovatively constructed 3 models to predict ADC, SCC, and LCC, respectively, whereas very few prediction models involving LCC subtype have been previously developed.Consequently, our study has contributed to a more refined and precise prediction of pathological subtypes in NSCLC.As well, few studies have dealt with the issue of data imbalance before.The study by Lin et al. (19), which did not deal with data imbalance, built a model with an AUC of only 0.700, significantly lower than our model.Therefore, we provide the possibility for the development of more accurate models in the future.
In addition, due to the black-box nature of machine learning (31), machine learning models lack interpretability in previous studies.Linning et al. (32) extracted features from CT images and built models to distinguish SCLC, ADC and SCC with AUC values of 0.822 and 0.665, respectively, but they failed to further explain the importance ranking of the features and the machine-learning decision-making process.We implemented the SHAP method.This technique illuminated the decisionmaking process of our models by providing a clear ranking of feature importance and delineating how each feature influences predictions.Such transparency is crucial in a clinical context, as it builds trust and aids clinicians in understanding the basis for model predictions, thus facilitating informed decision-making.By revealing the contributions of individual radiomic features to the classification of NSCLC subtypes, our approach not only enhances the trustworthiness and validation of our models but also deepens the understanding of the link between radiomic characteristics and cancer pathology, paving the way for more tailored and effective treatment approaches.There are also some shortcomings in this study.Compared with the results of previous studies, we did not significantly improve the effectiveness of machine learning models based on CT features in predicting NSCLC histological subtypes.In the future, more advanced algorithms and models such as deep learning and end-to-end modeling could be used to predict NSCLC pathological subtypes.
The limitations of this investigation are chiefly found in the following areas.Initially, the data, sourced from public databases, were characterized by a relatively modest sample size and an insufficiently detailed baseline profile, adversely affecting the model's predictive capability.Attaining improved efficacy would necessitate support from a more substantial research cohort.Secondly, the data, originating from a single center, employed internal data for the prediction model's validation, rendering the findings preliminary until corroborated by multicenter prospective studies.Lastly, the study exclusively involved imaging features in model development, omitting clinical details such as age and gender, resulting in a comparatively uniform dataset.This limitation may compromise the model's generalizability in realworld clinical diagnosis.

Conclusion
This study successfully developed interpretable machine learning models using CT images to diagnose histological subtypes of NSCLC, with the XGB and RF models showing superior performance.The use of SHAP for interpretability further strengthens the clinical relevance of our models, providing insights into the decision-making process and contributing to more informed and transparent diagnostic pathways.Looking ahead, there is an opportunity to build upon this foundation by creating advanced predictive models that integrate data from multiple centers and encompass multi-omics associations.

FIGURE 2 ROC
FIGURE 2 ROC curves and corresponding AUC values in the training (A) and testing (B) sets of the SCC group.

FIGURE 3 ROC
FIGURE 3 ROC curves and corresponding AUC values in the training (A) and testing (B) sets of the LCC group.
Guo et al. (20) developed two models, the ProNet model and com_radNet, based on CT scans.The AUCs were 0.840 and 0.789, with corresponding accuracy of 71.6 and 74.7%.Zhou et al. (25) used nine machine learning classifiers to construct 45 prediction models after extracting radiomic features from PET and CT scans.AUC (0.897) for 10.3389/fmed.2024.1413990

FIGURE 4 ROC
FIGURE 4 ROC curves and corresponding AUC values in the training (A) and testing (B) sets of the ADC group.

FIGURE 5
FIGURE 5 Model interpretability display.(A,D) Represent the SHAP plots for the SCC and LCC groups, respectively; (B,C) show the radiomics feature force plots for two random patients in the SCC group (predicted as SCC and non-SCC), respectively; (E,F) are the radiomics feature force plots for two random patients in the LCC group (predicted as LCC and non-LCC), respectively.

TABLE 1
The patient's clinical baseline data and statistical results.

TABLE 2
The key radiomic features for the SCC, ADC and LCC group.
wavelet_HLH_ngtdm_Busyness Spearman correlation heatmaps were created for each group.These heatmaps (Figures1A-C) provide a graphical illustration of the feature correlations, with varying intensities of color indicating the strength and direction of the correlations.