Application value of the automated machine learning model based on modified CT index combined with serological indices in the early prediction of lung cancer

Background and objective Accurately predicting the extent of lung tumor infiltration is crucial for improving patient survival and cure rates. This study aims to evaluate the application value of an improved CT index combined with serum biomarkers, obtained through an artificial intelligence recognition system analyzing CT features of pulmonary nodules, in early prediction of lung cancer infiltration using machine learning models. Patients and methods A retrospective analysis was conducted on clinical data of 803 patients hospitalized for lung cancer treatment from January 2020 to December 2023 at two hospitals: Hospital 1 (Affiliated Changshu Hospital of Soochow University) and Hospital 2 (Nantong Eighth People’s Hospital). Data from Hospital 1 were used for internal training, while data from Hospital 2 were used for external validation. Five algorithms, including traditional logistic regression (LR) and machine learning techniques (generalized linear models [GLM], random forest [RF], gradient boosting machine [GBM], deep neural network [DL], and naive Bayes [NB]), were employed to construct models predicting early lung cancer infiltration and were analyzed. The models were comprehensively evaluated through receiver operating characteristic curve (AUC) analysis based on LR, calibration curves, decision curve analysis (DCA), as well as global and individual interpretative analyses using variable feature importance and SHapley additive explanations (SHAP) plots. Results A total of 560 patients were used for model development in the training dataset, while a dataset comprising 243 patients was used for external validation. The GBM model exhibited the best performance among the five algorithms, with AUCs of 0.931 and 0.99 in the validation and test sets, respectively, and accuracies of 0.857 and 0.955 in the validation and test groups, respectively, outperforming other models. Additionally, the study found that nodule diameter and average CT value were the most significant features for predicting lung cancer infiltration using machine learning models. Conclusion The GBM model established in this study can effectively predict the risk of infiltration in early-stage lung cancer patients, thereby improving the accuracy of lung cancer screening and facilitating timely intervention for infiltrative lung cancer patients by clinicians, leading to early diagnosis and treatment of lung cancer, and ultimately reducing lung cancer-related mortality.


Introduction
Lung cancer is globally recognized as one of the malignancies with the highest incidence and mortality rates.According to the 2022 global cancer statistics survey, an average of approximately 350 individuals die from lung cancer every day, surpassing the combined total of breast, prostate, and pancreatic cancers.In China, lung cancer deaths account for 23.8% of the total cancer-related deaths, with the incidence and mortality rates ranking highest globally (1).Due to factors such as existing medical conditions and awareness of check-ups, many patients are diagnosed with latestage lung cancer during their initial medical visits.Effective treatment options for late-stage lung cancer are limited, with a 5-year cumulative survival rate of only 19% (2).Early screening significantly improves the prognosis and survival of lung cancer patients (3), so early screening and diagnosis is the key to reduce lung cancer mortality and improve survival rate.
Currently, there is a lack of effective early screening methods, with emphasis placed on low-dose spiral computed tomography (LDCT) scans, biological tumor markers, and tumor autoantibody screening (4).However, these methods suffer from drawbacks such as high false positive rates, inadequate sensitivity, and suboptimal accuracy.Therefore, we attempt to accurately predict tumor malignancy and infiltration depth using an improved CT index obtained through artificial intelligence recognition technology combined with serum biomarkers consisting of lung cancer autoantibodies and tumor markers.This approach aims to assist clinicians in making more informed treatment decisions and improving patient survival benefits.
Machine learning, as a subset of artificial intelligence, has shown remarkable prospects in various fields such as economics, finance, business management, and bioinformatics.In the healthcare sector, it demonstrates outstanding applications in analyzing disease-related factors, predicting risks, and computer-aided diagnosis (5)(6)(7).Automated machine learning (AutoML) automates the application of machine learning to data by iteratively transforming data, selecting machine learning algorithms, and optimizing hyperparameters to choose the best model.
The aim of this study is to evaluate the predictive value of an improved CT index combined with serum biomarkers using a GBM model for early diagnosis of lung cancer.Clinical data from lung cancer patients from two hospitals were collected, and training, validation, and testing were conducted using the H2OAutoML platform.The performance of the GBM model was compared with traditional logistic regression (LR) to assess its efficacy.

Inclusion and exclusion criteria
We retrospectively collected and analyzed data from patients who underwent lung cancer surgery at the Affiliated Changshu Hospital of Soochow University and Nantong Eighth People's Hospital from January 2020 to December 2023.Patients collected from January 2020 to December 2023 at the Affiliated Changshu Hospital of Soochow University were used as the training set, while patients collected from October 2022 to December 2023 at Nantong Eighth People's Hospital were used as the testing set.
The diagnostic criteria for lung cancer were referenced from the 2021 Fifth Edition of the WHO Classification of Thoracic Tumors (8).Diagnosis of lung cancer required meeting the following criteria: (1) Confirmation of lung nodules by chest CT without any clinical or drug intervention; (2) Definitive pathological results confirming benign or malignant nodules after chest CT; (3) Age ≥ 18 years; (4) Preoperative testing for 7 lung cancer autoantibodies and tumor markers; (5) Absence of significant dysfunction in other major organs; (6) Absence of other primary malignant tumors; and (7) Lung nodule diameter ≤ 3 cm.Exclusion criteria included: absence of pathological examination despite confirmed lung nodules on chest CT; failure to undergo testing for the 7 lung cancer autoantibodies and tumor markers; clinical or drug intervention prior to blood sampling; presence of rheumatic immunological diseases; lung metastasis from other tumors; lung nodule diameter > 3 cm.This study was approved by the hospital ethics committee.

Data collection
Demographic features, clinical information, and comorbidities were extracted from electronic medical records.Chest plain scans were performed using a 64-slice spiral CT scanner to obtain conventional CT imaging features, including air bronchogram sign, spiculated sign, lobulation sign, vascular penetration, pleural retraction, bronchial inflation sign, nodule diameter, and solid proportion.And the patient's CT data were imported into the DeepRay medical image AI recognition system, which extracted quantitative features from medical images in high throughput and combined with convolutional neural networks to train deep learning on the data of the nodule's size, density, and the proportion of solidity to get the improved CT indexes: the pulmonary nodule's malignancy probability value and average CT value.Serum biomarkers primarily included 7 tumor-associated autoantibodies (TAABs) and commonly used tumor markers recommended by the American Clinical Biochemistry Committee and the European Tumor Marker Expert Group.TAABs detection involved extracting fasting peripheral venous blood (9-12) from patients preoperatively or before surgery.After centrifugation to separate serum, the levels of 7 lung cancer autoantibodies were measured using enzymelinked immunosorbent assay (ELISA) (13), including tumor suppressor gene P53 (normal reference range: P53 < 13.09 U/mL), protein gene product PGP 9.5 (normal reference range: PGP9.5 < 11.1 U/mL), SRY-box containing gene 2 (normal reference range: SOX2 < 10.26 U/mL), G antigen 7 (GAGE7) (normal reference range: GAGE7 < 14.36 U/mL), RNA helicase autoantibody 4-5 (GBU4-5) (normal reference range: GBU4-5 < 6.99 U/mL), melanoma antigen A1 (MAGEA1) (normal reference range: MAGEA1 < 11.92 U/mL), and tumor-associated gene CAGE (normal reference range: CAGE <7.23 U/mL).TAABs detection results were considered positive if any of the indicators exceeded the normal reference range.Tumor markers were collected from blood tests and included primary lung cancer markers such as vascular endothelial growth factor (VEGF), carcinoembryonic antigen (CEA), neuron-specific enolase (NSE), cytokeratin fragment 19 (CYFRA21-1), pro-gastrin-releasing peptide (ProGRP), and squamous cell carcinoma antigen (SCC) (14).

Automated machine learning
Through the AI platform 1 , the H2O package is installed in the R language to implement AutoML analysis.Autonomy and automation are achieved through three aspects: feature selection, model construction, and hyperparameter optimization.The integrated algorithms include Generalized Linear Models (GLM), Random Forests (RF), Gradient Boosting Machines (GBM), Deep Neural Networks (DL), and Naive Bayes (NB), among others.The training set is split into development and validation sets in a 6:4 ratio, and blind verification is conducted with the testing set to evaluate the average accuracy and stability of the models.A confusion matrix consisting of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) is established (15).Performance metrics including sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (LR+), negative likelihood ratio (LR-), accuracy, area under the receiver operating characteristic curve (AUC), and the F1-Measure are calculated.Formulas for calculation are as follows: Accuracy = (TP + TN)/(TP + FP + FN + TN); PPV = TP/ (TP + NP); NPV = TN/(TN + FN); LR + =Sensitivity/(1−Specificity); LR− = (1−Sensitivity)/Specificity; F1-Measure = (2*precisionrecall)/ (precision+recall).Through SHAP analysis (Shapley Additive Explanations), an additive explanatory model is constructed to determine significant factors influencing model predictions and their contributions to model performance.

Statistical analysis
For continuous data, the Shapiro-Wilk test and homogeneity of variance test (Homogeneity of variance test) were first performed.For normally distributed and homoscedastic continuous data, independent samples t-tests were employed, and results were presented as mean ± standard deviation.For non-normally distributed and heteroscedastic continuous data, the Wilcoxon rank-sum test was used, and results were presented as median (M25, M75).Categorical data were expressed as frequencies and percentages, and inter-group differences were assessed using the chi-square test or Fisher's exact test.To prevent multicollinearity among variables, feature selection was conducted using the Least Absolute Shrinkage and Selection Operator (LASSO) regression model.Based on the selected variables, a binary logistic regression model was fitted.The predictive performance of the obtained model was evaluated using the area under the receiver operating characteristic curve (AUC), calibration curve, and decision curve analysis (DCA), and a Nomogram was constructed.The statistical significance level was set at p < 0.05.All statistical analyses were performed using R 4.3.3software.

Baseline characteristics
A total of 803 lung cancer patients were included in this study, with 376 cases (47.0%) exhibiting infiltrative lesions.The study protocol is detailed in Figure 1.Among them, 560 patients from the Affiliated Changshu Hospital of Soochow University (Hospital 1) were included in the training set.Nantong Eighth People's Hospital (Hospital 2) contributed 243 patients as the testing set.In the training set, 64.3% (360/560) were male and 35.7% (200/560) were female, with a median age of 55 years.In the testing set, females were more common in the infiltrative group, and the age range of 40-60 years was the peak incidence, consistent with previous reports (16).There were no statistically significant differences between the two groups in terms of age, CY211, NSE, and Leafing (p > 0.05).Details are shown in Table 1.

Model construction and predictive performance comparison 3.2.1 LASSO regression feature screening and LR model construction
Considering the potential issue of multicollinearity among variables, we employed the LASSO regression model with the introduction of the L1 regularization coefficient.Through 10-fold cross-validation, we obtained the minimum standard lambda and selected 8 variables as independent risk factors from 19 variables.These variables included VEGF, TAABs, malignancy probability, average CT value, nodule diameter, solid proportion, gender, and pleural retraction, as shown in Figure 2.
The selected features were fitted to construct a serum-modified CT index model, and a Nomogram plot was generated to score the features (see Figure 3).The total score obtained by summing the scores of each feature allows estimation of the probability of developing infiltrative lesions in lung cancer.The study showed that when the total score of the Nomogram for lung cancer infiltrative lesions exceeds 180, the risk of lesions is over 90%.
To further analyze the stability and clinical utility of the serummodified CT index model, we compared the serum-modified CT index model with conventional imaging omics models and serumimaging omics models in both the training and testing sets using ROC

Machine learning model construction and performance comparison
Using the H2OAutoML platform, automatic training and adjustment of models were conducted within a 5 min time limit, resulting in the construction of 75 models.However, due to limited interpretability and the presence of stacked ensemble models, these models were simplified, and the main algorithms involved were extracted, including Generalized Linear Model (GLM), Random Forest (RF), Gradient Boosting Machine (GBM), Deep Neural Network (DL), and Naive Bayes (NB).Among these models, the GBM model outperformed others, achieving the highest values for AUC, accuracy, and F1-Measure on both validation and testing sets, and hence was considered the optimal model.As shown in Table 2, on the validation and testing sets, the AUC values obtained by the GBM algorithm were higher than those obtained by GLM, RF, DL, and NB algorithms, with values of (0.931, 0.99) compared to (0.917, 0.942), (0.918, 0.986), (0.901, 0.948), and (0.908, 0.944), respectively.Furthermore, compared to GLM, RF, DL, and NB algorithms, the Lasso regression variable screening.GBM algorithm also achieved the highest accuracy, with values of (0.857, 0.955), (0.854, 0.864), (0.838, 0.947), (0.819, 0.877, 0.844, 0.889), respectively.Among these models, the RF model exhibited the highest sensitivity in both the validation and testing sets, with values of 0.914 and 0.991, respectively.Both RF and GLM models demonstrated good performance in terms of AUC, sensitivity, specificity, and accuracy.

Overall feature interpretability analysis
Figure 7 shows that nodule diameter size is the most important feature, followed by average CT value, solid proportion, NSE, VEGF, CYFRA21-1, SCC, malignancy probability, CEA, and proGRP.Additionally, nodule diameter size, average CT value, malignancy probability, solid proportion, and VEGF were identified  as important feature variables shared by both the GBM and logistic regression models.Figure 8, the SHAP summary plot, displays the impact of all features on the predictive performance of the GBM model in the testing set.The x-axis represents the SHAP values, indicating the contribution of features to the overall prediction.A SHAP value greater than 0 indicates a positive contribution, meaning that as the variable's value approaches 1, the likelihood of infiltration in patients increases.For example, on the SHAP plot corresponding to nodule diameter, red points are mainly located to the right of the zero axis, while blue points are more on the left, suggesting that as the nodule diameter increases, the likelihood of infiltrative lesions in lung nodules also increases.

Individual feature interpretability analysis
As shown in Figure 9, partial dependence plots illustrate the impact of individual features on the final discrimination of the GBM model and   Summary plot of GBM model SHAP in the test set.
their distribution in the dataset.Nodule diameter size, malignancy probability, and VEGF are positively correlated with the likelihood of infiltrative lesions.Nodule diameter is mainly distributed below 15 mm, but for lung cancer patients falling between 15 and 18 mm, there is a higher likelihood of infiltrative lesions, necessitating regular follow-up.
As the average CT value gradually increases, it tends to indicate non-invasive lung cancer, particularly in patients with values above −200, essentially ruling out the possibility of infiltrative lung cancer.The SHAP explanation illustrates the feature contributions for specific instances.As depicted in Figure 10, for instance 72, with a nodule diameter of 22 mm, average CT value of -525HU, and malignancy probability of 86%, these factors significantly contribute to the model's final determination of infiltrative lung cancer.Conversely, in instance 98, although the nodule diameter is below 15 mm, predictions of infiltrative lung cancer are made based on factors such as average CT value, NSE value, and malignancy probability.

Discussion
Lung cancer ranks among the most prevalent and fatal malignancies globally, with adenocarcinoma being the most common histological subtype.Accurate differentiation between non-invasive and invasive lung cancer significantly impacts patient prognosis and survival.Therefore, constructing early lung cancer Feature interpretability analysis results show that the most crucial feature of the GBM model is nodule diameter size, consistent with the results of the logistic regression model in this study and the risk factors for lung nodule benignity/malignancy reported in related studies (22,23).Other researchers have pointed out that as nodule diameter increases, the likelihood of malignancy also increases.For instance,  nodules below 5 mm have a malignancy rate of only 1%, while those between 5 and 10 mm have a malignancy rate of 25% (24).In this study, we found that nodules larger than 15 mm have a higher malignancy probability, particularly between 15 and 18 mm, where infiltration is more likely to occur.Therefore, patients should have shorter follow-up intervals, and clinicians should pay close attention to patients with nodules larger than 15 mm, increasing the frequency of follow-up visits.This finding is consistent with other research (25,26).
With the development and application of artificial intelligence technology, AI-based medical imaging has been widely used in clinical diagnosis and treatment, particularly in lung cancer early screening, significantly improving lung nodule detection rates and reducing the rate of missed small lesions.This study demonstrates that AI-enhanced CT indices significantly contribute to the discrimination of infiltrative lung cancer, enhancing lesion identification accuracy.However, there are limitations.According to previous studies, although CT AI has higher positive predictive values and sensitivity, its specificity is not ideal, ranging from 70 to 80% (27)(28)(29)(30).Therefore, relying solely on radiological imaging to differentiate between benign and malignant lung nodules is too one-sided.This study established a predictive model combining AI with other laboratory indicators to improve the specificity and accuracy of lung nodule detection.
In recent years, laboratory indicators for lung cancer have mainly focused on primary lung cancer biomarkers and seven lung cancer autoantibodies.In contrast to artificial intelligence CT, these indicators have high specificity but low sensitivity when used alone.Therefore, they are typically used in combination for early lung cancer screening.Vascular endothelial growth factor (VEGF) levels serve as an independent risk factor for lung cancer infiltration, as evidenced by significant expression in both LR and GBM models.Studies have shown that VEGF can increase vascular permeability (31-33), thereby promoting tumor metastasis, and its overexpression indicates poor prognosis in lung cancer.Therefore, patients with abnormal VEGF levels should be closely monitored, and further diagnostic and clinical intervention measures should be implemented.Detection of serum lung cancer autoantibodies has a certain clinical decision-making value for lung cancer diagnosis (34)(35)(36), although in this study there was a statistically significant difference between the non-infiltrating group and the infiltrating group in the training set, but showed no statistically significant difference between the non-infiltrating group and the infiltrating group in the test set, which indicates that the 7-item serum lung cancer autoantibody test is not suitable to be applied alone in discriminating non-infiltrating versus infiltrating early stage lung cancer, and that it needs to be combined with other indicators for prediction.
In addition, we used five different ML algorithms to construct a highprecision prediction model.The GBM model showed optimal prediction efficacy on both the test and validation sets and achieved higher AUC and accuracy than the LDCT+7-TABBs model constructed by Zhong et al. (37), which fully demonstrated that the CT metrics modified by AI are more accurate, and can provide more comprehensive and high-quality information for clinically assisted diagnosis and treatment.By accurately predicting the invasiveness of early lung nodules, this study can help patients receive earlier treatment, thereby improving survival rates and prognosis.The blind validation using a validation set and external dataset with larger sample sizes and higher external validity mitigated potential biases arising from unique circumstances at a single research center.However, our study also has some limitations.Firstly, it only studied benign and infiltrative lung cancer categories, necessitating the expansion of case numbers to further classify lung cancer.Additionally, this study is retrospective, which introduces selection bias, highlighting the need for more prospective studies for external validation.

Conclusion
A predictive early-stage lung cancer infiltrative machine learning model was constructed and compared by combining improved CT indices with serological markers, using SHAP to elucidate the clinical significance of each risk factor in predicting infiltrative lesions in early-stage lung cancer patients.The CT indices improved by artificial intelligence are closely associated with lung cancer infiltrative features, holding significant application value in future clinical research.This combination can assist clinicians in implementing early clinical interventions, providing more comprehensive information for selfscreening and disease management of early-stage lung cancer patients, thereby preventing and reducing the risk of infiltration.

FIGURE 1
FIGURE 1Roadmap for the research program.

FIGURE 7
FIGURE 7Plot of the importance ranking of the GBM model variables in the test set.

TABLE 1
Baseline characteristics of patients in training and test groups.
curve analysis, clinical calibration curve, and clinical decision curve analysis (DCA).The conventional imaging omics model consisted of nodule diameter, solid proportion, gender, and pleural retraction.The serum-imaging omics model included VEGF, TAABs, nodule diameter, solid proportion, gender, and pleural retraction.The

TABLE 2
Comparison of AutoML model performance in predicting lung cancer infiltration in the test cohort.indicates area under the curve; PPV, positive predictive value; NP, negative predictive value; LR−, negative likelihood ratio; LR+, positive likelihood ratio; GLM, Generalized linear model; RF, Random forest; GBM, gradient boosting machine; DL, deep neural net; NB, Naive Bayes. AUC