Development and validation of AI models using LR and LightGBM for predicting distant metastasis in breast cancer: a dual-center study

Objective This study aims to develop an artificial intelligence model utilizing clinical blood markers, ultrasound data, and breast biopsy pathological information to predict the distant metastasis in breast cancer patients. Methods Data from two medical centers were utilized, Clinical blood markers, ultrasound data, and breast biopsy pathological information were separately extracted and selected. Feature dimensionality reduction was performed using Spearman correlation and LASSO regression. Predictive models were constructed using LR and LightGBM machine learning algorithms and validated on internal and external validation sets. Feature correlation analysis was conducted for both models. Results The LR model achieved AUC values of 0.892, 0.816, and 0.817 for the training, internal validation, and external validation cohorts, respectively. The LightGBM model achieved AUC values of 0.971, 0.861, and 0.890 for the same cohorts, respectively. Clinical decision curve analysis showed a superior net benefit of the LightGBM model over the LR model in predicting distant metastasis in breast cancer. Key features identified included creatine kinase isoenzyme (CK-MB) and alpha-hydroxybutyrate dehydrogenase. Conclusion This study developed an artificial intelligence model using clinical blood markers, ultrasound data, and pathological information to identify distant metastasis in breast cancer patients. The LightGBM model demonstrated superior predictive accuracy and clinical applicability, suggesting it as a promising tool for early diagnosis of distant metastasis in breast cancer.


Background
Breast cancer is one of the most common malignancies affecting women worldwide, posing a significant threat to women's health.By 2020, breast cancer had become one of the most frequently diagnosed cancers globally (1).While ranking fourth in terms of mortality, it showed the most significant increase in new death cases (1).In China alone, there are over 410,000 new cases of breast cancer annually, with over 110,000 associated deaths (2).The majority of deaths result from cancer metastasis, with approximately 20-30% of breast cancer patients likely to experience this occurrence (3).
Distant metastasis is a common form of recurrence and a lifelong risk for breast cancer patients (4).The sites of breast cancer metastasis are closely linked to patient survival, with the most common sites being bones, lungs, and liver (3,5).Distant metastasis significantly diminishes the quality of life for breast cancer patients and can lead to mortality (4,6).
Before the pathological confirmation of breast cancer metastasis through biopsy, MRI and CT scans are usually conducted to provide relevant indications (7,8).When results from these imaging studies are inconclusive, diagnostic information may be provided by functional imaging modalities such as positron emission tomography, dynamic contrast-enhanced magnetic resonance imaging, or diffusion-weighted magnetic resonance imaging (9).The decision to conduct a series of imaging examinations for breast cancer patients entirely depends on the clinical suspicion of the physician, and even in cases of suboptimal results, expensive functional imaging studies may be required.For patients in some developing countries, the cost of multiple imaging examinations can be relatively high, resulting in a significant economic burden.Additionally, imaging examinations have certain limitations in certain situations (7).
To address these challenges, some studies have begun to explore the use of artificial intelligence (AI) technology to assist in predicting breast cancer metastasis (10)(11)(12)(13)(14)(15).This AI-based approach holds promise for providing faster and more accurate diagnoses, while potentially reducing the need for expensive imaging studies, thereby alleviating the economic burden on patients.The current research primarily focuses on predicting the risk of breast cancer metastasis in the future (1 year, 3 years, or 5 years) (10,12,13,(15)(16)(17)(18), while there is relatively less emphasis on diagnostic predictions for distant metastasis of breast cancer (11,14,(19)(20)(21).In the study by Huang et al., the SEER database was used to predict bone metastasis in invasive ductal carcinoma; however, their study did not mention a validation set (11).Ma et al. developed a fusion model integrating clinical-pathological data with MRI features, which also showed promising performance (14).Similarly, Li et al. (19) also utilized MRI features and clinical pathological characteristics to establish a predictive model, but they did not mention the machine learning algorithms used, nor did they validate the model with external data.Additionally, Zhao et al. (20) used the SEER database and four machine learning algorithms, including Extreme Gradient Boosting (XGBoost), k-Nearest Neighbors (KNN), Decision Tree (DT), and Support Vector Machine (SVM) to predict the risk of distant metastasis in breast cancer, with XGBoost performing the best.Furthermore, Burak Yagin et al. (21) used genomic data from 98 breast cancer cases and several algorithms including Light Gradient Boosting Machine (LightGBM), Categorical Boosting (CatBoost), Extreme Gradient Boosting (XGBoost), Gradient Boosted Trees (GBT), and Adaptive Boosting (AdaBoost) to build a model for predicting distant metastasis in breast cancer, with LightGBM being the best performer.The above studies suggest that using AI to evaluate breast cancer metastasis before conducting relatively expensive whole-body imaging studies may help eliminate unnecessary imaging examinations.
In this study, we established AI models to identify breast cancer metastasis by integrating clinical blood markers, ultrasound data, and breast biopsy pathology.The algorithms used include not only the well-performing XGBoost and LightGBM from previous research but also AdaBoost and Logistic Regression (LR).This method not only improves the affordability and accessibility of diagnosis but also offers new avenues and possibilities for the early diagnosis of breast cancer metastasis.With ongoing technological advancements and deeper research, the application of AI in predicting breast cancer metastasis holds promise as a significant future development direction.

Patient population
This retrospective study included data from two medical centers, approved by the institutional review boards of both centers.Inclusion criteria were as follows: (1) definitive diagnosis of de novo primary breast cancer with or without distant metastasis; (2) completion of ultrasound examination, clinical blood marker testing, and breast biopsy pathology examination before treatment (radiotherapy or chemotherapy) or surgical resection; (3) no history of hypertension, diabetes, or hyperlipidemia; (4) no history of abnormalities in liver, kidney, or cardiovascular function blood markers; (5) no history of other diseases.Exclusion criteria were as follows: (1) distant metastasis occurred after treatment (surgical resection or chemotherapy); (2) ultrasound examination not performed due to unavoidable reasons (such as breast surface dressing coverage); (3) ultrasound examination did not provide the maximum diameter of the lesion; (4) clinical blood markers did not include tumor markers (AFP, CEA, CA125, CA153, and CA199), liver function tests, kidney function tests, lipid profile, or cardiovascular function markers; (5) the biopsy pathology examination did not provide immunohistochemical results for ER, PR, HER2, or Ki67.The breast cancer cases involved in the study were from two research centers, one comprising 342 patients randomly divided into training (274 patients) and test (68 patients) cohorts at an 8:2 ratio, and the other center's 75 patients served as an external testing set (test1 cohort).Given that breast cancer distant metastasis in this study mainly occurs in the bones, lungs, and liver, with detailed local distributions outlined in Table 1.The workflow of the study's model is illustrated in Figure 1.

Feature extraction and selection
Features extracted from clinical blood markers included tumor markers (carcinoembryonic antigen, alpha-fetoprotein, CA125, CA153, and CA199), liver function indicators (total bilirubin, direct bilirubin, indirect bilirubin, total protein, albumin, globulin, albumin-globulin ratio, g-glutamyl transferase, pre-albumin, aspartate transaminase (AST), alanine transaminase (ALT), AST/ ALT ratio, alkaline phosphatase, cholinesterase, and total bile acids The extracted features underwent the following procedures: first, standardization was performed using z-score normalization (mean = 0, standard deviation = 1) to preprocess the data to conform to a standard normal distribution.Next, Spearman rank correlation coefficient was utilized for statistical analysis to measure the correlation between two variables.When the Spearman correlation coefficient between features was >0.9, one of the highly correlated features was retained.This method employs a "greedy approach".It selects the most redundant feature at each step to retain, aiming to minimize the correlation between features and thus enhance the models' generalization ability and performance.Finally, LASSO regression with L1 regularization was employed for feature dimensionality reduction.This method selects highly correlated features and generates sparse models,  The workflow of LR and LightGBM models in this study.

Development and validation of models
LR, LightGBM, GBoost and AdaBoost machine learning algorithms were employed in this study to construct models for breast cancer with and without distant metastasis as binary outcome variables.Model construction was performed based on 5-fold crossvalidation of the training set.After model construction, validation was conducted on the internal and external testing sets.Performance evaluation was conducted using metrics such as the area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).Subsequently, clinical decision curve analysis (DCA) was performed, reflecting the net benefit of different threshold probabilities in the training and internal and external validation sets to assess the clinical efficiency of the model.

Statistical analysis
The analysis of clinical baseline features was performed using SPSS software (version 25.0, IBM).For the comparison of normally distributed continuous variables with homogeneity of variance (expressed as x ± s) across multiple groups, ANOVA was used.For the comparison of non-normally distributed or heteroscedastic continuous variables (expressed as median (IQR)) across multiple groups, the Kruskal-Wallis H test was employed, while pairwise comparisons were conducted using the Mann-Whitney U test.For categorical variables (expressed as ratios), chi-square tests or Fisher's exact tests were used.A two-tailed p-value < 0.05 indicated statistical significance.Spearman rank correlation tests, z-score normalization, LR model output (displaying feature coefficients), LightGBM model feature importance output, and LASSO regression analysis were performed using Python software (version 3.7.17;http://www.python.org).ROC curves and clinical decision curves were plotted accordingly.The evaluation of the models involved AUC values, accuracy, sensitivity, specificity, PPV, NPV, and DCA, which were implemented using Python software.

Patient characteristics
This study involved 417 patients of breast cancer, all female, from two research centers.One center contributed 274 patients to the training cohort and 68 patients to the test cohort, while the other center provided 75 patients for the Test1 cohort.Disparities were observed among the creatine kinase isoenzyme, ahydroxybutyrate dehydrogenase, indirect bilirubin, globulin, albumin-globulin ratio, blood bicarbonate concentration, total bile acids, Na, Cl, Mg, and the maximum diameter of breast cancer lesions on ultrasound among the three cohorts (Table 1).Pairwise comparisons revealed differences between the Training cohort and the Test1 cohort for most markers, as well as between the Test cohort and the Test1 cohort (Supplementary Table 1), suggesting that the data indeed originated from two research centers, with the Training cohort and Test cohort coming from the same center.Patient ultrasound, pathological, and clinical blood marker characteristics are summarized in Table 1.

Feature selection
The feature data were normalized and one of the features with a Spearman correlation coefficient > 0.9 was retained.Dimensionality reduction was conducted by eliminating features with zero coefficients through LASSO regression.The optimal l value (0.0193) was determined based on the minimum Mean Squared Error (Figure 2A), and a Lasso regression model was fitted using the optimal l value (Figure 2B).After feature dimensionality reduction, 27 features were finally selected (Figure 2C).Each of these features was then used independently as input for subsequent model building.

Construction and validation of LR and LightGBM models
The selected features were used to construct LR, LightGBM, GBoost, and AdaBoost models, with performance parameters shown in Table 2.The AUC values for the LR and LightGBM models in external validation were relatively high, with the ROC curve results displayed in Figures 3A and B, respectively.The ROC for the LR model in the training, test, and Test1 cohorts was 0.892 (95% CI 0.853-0.931),0.816 (95% CI 0.715-0.917),and 0.817 (95% CI 0.722-0.913),respectively.For the LightGBM model, the ROC was 0.971 (95% CI 0.955-0.987),0.861 (95% CI 0.775-0.948),and 0.890 (95% CI 0.818-0.962) in the training, test, and Test1 cohorts, respectively.Other performance parameters are presented in Table 2.The DCA curves for both models in the training, test, and Test1 cohorts are displayed in Supplementary Figure 1 and Figures 4A, B. The results indicate that the LightGBM model exhibited significantly higher net benefits at various threshold probabilities in all cohorts compared to the LR model, suggesting superior performance in identifying breast cancer with distant metastasis.

Model feature analysis
To identify key features contributing to the prediction of distant metastasis in LR and LightGBM models, feature analysis was conducted.The results are shown in Figures 5A, B. In the LR model, the top 5 features with relatively significant impact on the outcome were a-HBDH, Ki67, ALP, maximum diameter of lesions on ultrasound, and CEA.In the LightGBM model, the top 5 features with relatively significant contributions were CK-MB, CA153, a-HBDH, apolipoprotein B, and CEA.

Discussion
In this study, we employed LR and LightGBM algorithms to construct predictive models for identifying breast cancer with distant metastasis based on clinical blood markers, ultrasound examination, and breast biopsy pathology features.The LightGBM model demonstrated superior net benefits and predictive performance compared to the LR model, as evidenced by its higher AUC values in both internal and external testing datasets.These findings suggest that our models can effectively identify breast cancer patients with distant metastasis, providing clinicians with a more efficient method for early detection and intervention.This could lead to personalized treatment plans that improve patient outcomes and quality of life.
Previous studies have typically focused on assessing future metastasis risk to predict breast cancer distant metastasis.The evaluation of the Receiver Operating Characteristic curves for both the Logistic Regression (A) and LightGBM (B) models was conducted across three different datasets: the training cohort, the test cohort, and an additional independent test cohort (test1).This comprehensive evaluation allows for a thorough comparison of model performance and generalizability.

B A
Clinical decision curves analysis (DCA) for the LR and LightGBM models constructed in the test (A), and test1 (B) cohorts were demonstrated.Treat-All: Treating all cases as if they have metastatic breast cancer, regardless of whether the model predicts metastatic or non-metastatic stages; Treat-None: Treating all cases as if they do not have metastatic breast cancer, regardless of whether the model predicts metastatic or non-metastatic stages; Net benefit: Evaluate the practical utility of a model at different decision thresholds.A higher net benefit indicates that the model's predictions have greater value for clinical decision-making at that threshold.Through DCA, net benefit helps determine whether the model outperforms the simple "Treat-All" or "Treat-None" strategies at different thresholds.If the model's net benefit at a given threshold exceeds that of the "Treat-All" and "Treat-None" strategies, it suggests that using the model's predictions is more beneficial than either extreme strategy at that threshold.
However, fewer studies have focused on diagnostic prediction models for patients with existing distant metastasis.Li et al. used radiomic features from magnetic resonance imaging (MRI) alone or combined with clinicopathological features for prediction, achieving AUC values of 0.744 and 0.763, respectively (19); however, the study did not specify the machine learning algorithms used.Huang et al. predicted bone metastasis in invasive ductal carcinoma using the SEER database, achieving an AUC of 0.907 (11) This study is also based on XGBoost and LightGBM and uses clinical blood markers indicative of cardiac, hepatic, and renal function, combined with ultrasound and other clinicopathological features, to construct models validated across different centers.In our external data validation, the LightGBM model performed better, achieving an AUC of 0.890.
CK-MB was identified as one of the most important features in the LightGBM model prediction.As a creatine kinase isoenzyme, CK-MB exists mainly in the myocardium and skeletal muscles (22).Previous studies have found that the ratio of CK-MB to total CK is significantly higher in advanced malignant tumor patients compared to early-stage ones (23), suggesting an association between CK-MB and cancer progression stages.Moreover, serum CK-MB activity is significantly elevated in metastatic tumor patients compared to those with primary tumors (22).Regarding the source of elevated serum CK-MB in malignant tumor patients, studies have detected a higher proportion of CK-MB in tumor tissues of lung cancer patients, implying that the increased plasma CK-MB may originate from tumor tissues rather than myocardium and skeletal muscles (24).In our study, CK-MB played a crucial role as one of the key features in the model prediction, suggesting its importance in predicting breast cancer with distant metastasis.However, further research is needed to explore why CK-MB elevation occurs in breast cancer with distant metastasis and whether elevated CK-MB originates from tumors or other sources.a-HBDH, as an LDH isoenzyme, is significantly elevated in the serum of some malignant tumor patients and is associated with the prognosis of malignant tumors (25)(26)(27).The combined application of a-HBDH, CEA, and CA125 in the early diagnosis of breast cancer has been found to be valuable (28).CA153 is a common tumor marker with predictive ability for breast cancer distant metastasis (29).In our study, CA153 was also one of the important features in model construction.
This study has some limitations.Firstly, we only included common types of distant metastases of breast cancer, such as bone, liver, and lung metastases.This means that we did not consider other types of distant metastases, such as brain metastases and posttreatment breast cancer distant metastases.The prognosis of posttreatment metastatic breast cancer may be worse because treatment may lead to the reselection of tumor molecules, making them more invasive (30).Secondly, although our data came from two different medical centers, they were both located in the same region.Therefore, our dataset may lack sufficient representativeness and requires validation across broader geographic areas, even across multiple centers internationally.Finally, due to potential differences among different healthcare institutions or equipment, the performance of our model may vary in different environments.Therefore, our model may require more validation datasets to ensure its applicability and reliability in different clinical settings.

B A
Illustrating the feature analysis aimed at identifying key features contributing to the prediction of distant metastasis, both LR (A) and LightGBM (B) models are scrutinized.In panel (A), coefficients with corresponding p-values less than 0.05 will be marked with an asterisk (*) beside the coefficients.
In conclusion, this study successfully developed and validated LR and LightGBM machine learning models based on clinical blood markers, ultrasound data, and biopsy pathology features to predict distant metastasis in breast cancer patients.Particularly, the LightGBM model exhibited higher accuracy and potential clinical application value in predicting and identifying breast cancer with distant metastasis.These tools are expected to elevate the level of clinical decision-making and prognosis assessment, potentially reducing the need for expensive or invasive imaging techniques.This study highlights the prospects of using readily available clinical blood markers and cost-effective ultrasound data for developing artificial intelligence predictive tools.
In conclusion, our study successfully developed and validated LR and LightGBM models using clinical blood markers, ultrasound data, and biopsy pathology features to predict distant metastasis in breast cancer patients.The LightGBM model, in particular, demonstrated higher accuracy and potential clinical utility.These models could enhance clinical decision-making and prognosis assessment, reducing reliance on expensive or invasive imaging techniques.Our findings underscore the potential of integrating readily available clinical data and machine learning for early and accurate prediction of breast cancer metastasis.

FIGURE 1
FIGURE 1 For instance, Delpech et al. and Xu et al. developed nomograms to predict bone metastasis, with C-indices ranging from 0.69 to 0.73 and 0.705 to 0.714, respectively (10, 12).Zhang et al. used MRI and ultrasound features to develop a prognostic nomogram, achieving Cindices of 0.882 and 0.812 (15).Wang et al. utilized gene expression profiles for a nomogram predicting lung metastasis risk, with C-indices of 0.862 and 0.772 (13).Additionally, Shidi Miao et al. constructed a nomogram model using CT image features of muscles and clinicopathological features, achieving C-indices of 0.983 and 0.948 in the training and test cohorts, respectively, although it was not externally validated (18).Besides these nomogram models, Li et al. used the SEER database (2010-2019) and the XGBoost algorithm to construct a model predicting survival rates in breast cancer patients with brain metastasis (AUC around 0.8), with external validation using their center's data (AUC around 0.7) (16).
FIGURE 3 Ma et al. developed a fusion model combining clinicopathological and MRI features, achieving AUC values of 0.870 and 0.822, respectively (14).Besides the aforementioned nomogram models, other algorithms have shown performance in predicting breast cancer.For example, Zhao et al. used four machine learning algorithms to predict the risk of breast cancer distant metastasis, with XGBoost performing best (AUC of 0.907 in the training set and 0.754 in the validation set) (20).Burak Yagin et al. constructed a model predicting breast cancer distant metastasis using the genomic data of 98 breast cancer cases, with the LightGBM model performing best (21).

TABLE 1
Clinical blood markers, pathological, and ultrasound characteristics in the training, test, and test1 cohorts.