Development and validation of machine-learning models for the difficulty of retroperitoneal laparoscopic adrenalectomy based on radiomics

Objective The aim is to construct machine learning (ML) prediction models for the difficulty of retroperitoneal laparoscopic adrenalectomy (RPLA) based on clinical and radiomic characteristics and to validate the models. Methods Patients who had undergone RPLA at Shanxi Bethune Hospital between August 2014 and December 2020 were retrospectively gathered. They were then randomly split into a training set and a validation set, maintaining a ratio of 7:3. The model was constructed using the training set and validated using the validation set. Furthermore, a total of 117 patients were gathered between January and December 2021 to form a prospective set for validation. Radiomic features were extracted by drawing the region of interest using the 3D slicer image computing platform and Python. Key features were selected through LASSO, and the radiomics score (Rad-score) was calculated. Various ML models were constructed by combining Rad-score with clinical characteristics. The optimal models were selected based on precision, recall, the area under the curve, F1 score, calibration curve, receiver operating characteristic curve, and decision curve analysis in the training, validation, and prospective sets. Shapley Additive exPlanations (SHAP) was used to demonstrate the impact of each variable in the respective models. Results After comparing the performance of 7 ML models in the training, validation, and prospective sets, it was found that the RF model had a more stable predictive performance, while xGBoost can significantly benefit patients. According to SHAP, the variable importance of the two models is similar, and both can reflect that the Rad-score has the most significant impact. At the same time, clinical characteristics such as hemoglobin, age, body mass index, gender, and diabetes mellitus also influenced the difficulty. Conclusion This study constructed ML models for predicting the difficulty of RPLA by combining clinical and radiomic characteristics. The models can help surgeons evaluate surgical difficulty, reduce risks, and improve patient benefits.


Introduction
Adrenal tumors (ATs) are a rare type of tumor that usually occurs in the cortex or medulla of the adrenal gland (1).Depending on their type and size, these tumors can be benign or malignant (2).ATs can cause many symptoms, including high blood pressure, palpitations, headaches, insomnia, anxiety, and obesity (3).In some cases, these symptoms may be mistaken for symptoms of other diseases, so further testing is needed to determine the diagnosis (4)(5)(6).
Treatment for AT includes surgery, radiation therapy, and chemotherapy.Surgery is the most common treatment method and can altogether remove the tumor (7).The gold standard treatment for AT is laparoscopic surgery, which can be divided into two main approaches: transperitoneal laparoscopic adrenalectomy (TPLA) and retroperitoneal laparoscopic adrenalectomy (RPLA) (8).The RPLA involves entering the retroperitoneal cavity through laparoscopic surgery, avoiding interference with abdominal organs, and reducing surgical trauma and recovery time.Compared with traditional open surgery, this technique has fewer complications and faster recovery (6,9,10).
In the field of medicine, machine learning (ML) has wideranging applications (3).For example, ML can be used for medical image recognition to help doctors diagnose diseases.It can also be used to predict the health status of patients, assisting doctors to develop better treatment plans.In addition, ML can be used for drug development and clinical trials to speed up the development and launch of new drugs (11).For example, a study has used ML to differentiate between adrenal pheochromocytoma and adrenocortical adenoma (12).
Radiomics is an emerging field of medicine that combines computer science, mathematics, and medical imaging to understand better and diagnose diseases (13,14).Radiomics analyzes large amounts of medical imaging data to extract useful information, helping doctors make more accurate diagnoses and treatment decisions (15).
This study aimed to collect data retrospectively from patients with AT who underwent RPLA at Shanxi Bethune Hospital from August 2014 to December 2020.The study utilized ML to analyze their clinical and radiomics features and develop a predictive model for the difficulty of RPLA.The goal was to improve preoperative preparation, reduce surgical risks, and enhance patient benefits.

General information
We retrospectively collected data from patients with AT treated at Shanxi Bethune Hospital between August 2014 and December 2020.A model was established using this data and prospectively validated with AT patients treated from January 2021 to December 2021.Inclusion criteria: 1) abdominal Computed Tomography (CT) examination confirming the presence of an AT within 15 days before surgery, 2) preoperative routine laboratory tests to determine the hormonal activity of AT, and 3) treatment of AT with laparoscopic surgery.Exclusion criteria: 1) patients who did not undergo surgery, 2) patients who underwent multiple surgeries concurrently, 3) patients treated for AT with other surgical methods, and 4) patients with incomplete preoperative radiological examination.A total of 396 patients were included in the study, and an additional 117 patients were collected for prospective validation (Figure 1A).All surgical procedures are performed by a cohesive team within the same department at a single center, led by an expert surgeon with 35 years of experience.

Research method
Referring to previous studies (9,10,(16)(17)(18)(19)(20) and combining practical experience, we defined cases with serious surgical difficulty if any of the following conditions were met:  of 1967 radiomics features were extracted (Figure 2).Data on patients' clinical conditions and treatment were obtained from the computerized physician order entry and medical record management system (Winning Health Technology Group Co., Ltd., Shanghai, China).Patients were randomly divided into a training set and a validation set at a ratio of 7:3.The training set was used for model construction, and the validation set was used for model validation.

Statistical methods
Data were further analyzed using R 4. 2. 3 (Vienna Statistical Computing Foundation, Austria).All continuous variables were non-normally distributed and were presented as median [interquartile range]; categorical variables were presented as frequency and percentage (%).Analysis of variance (ANOVA) was used to compare differences between sets in the training set, validation set, and prospective set.The consistency of the regions of interest (ROIs) drawn by the two urologists was evaluated using the intraclass correlation coefficient (ICC), excluding features with a correlation below 0. 75.Radiomics features were subjected to univariable logistic regression analysis using the "glmnet" package.Factors with a P-value greater than 0. 05 were considered unrelated and subsequently excluded.Key features were selected using the Least Absolute Shrinkage and Selection Operator (LASSO), and the radiomics score (Rad-score) was calculated based on the results.Using the "mlr3" package, seven ML models were developed by combining the Rad-score with clinical characteristics.These models included Classification and Regression Trees (CART), K-Nearest Neighbors (KNN), LASSO, Naive Bayes (NB), Random Forest (RF), Support Vector Machine (SVM), and Extreme Gradient Boosting (xGBoost).The optimal models were selected based on precision, recall, area under the curve (AUC), F1 score, calibration curve, Receiver Operating Characteristic (ROC) curve, and decision curve analysis (DCA) in the training set, validation set, and prospective set.Shapley Additive exPlanations (SHAP) value demonstrated the impact of each variable in the respective model (22).

General information
A total of 396 patients were included in the study.Patients were randomly divided into a training set and a validation set at a ratio of 7:3.Baseline patient characteristics are shown in Table 1.A total of 130 patients were considered to have high surgical difficulty due to meeting one or more criteria, with specific reasons shown in Figure 1E.An additional 117 patients were collected and regarded as a prospective set.ANOVA showed no statistically significant differences in baseline characteristics between sets.

Radiomics feature selection
Consistency was assessed using ICC, excluding 619 radiomics features with consistency lower than 0.75 to eliminate the interference of human factors on the model.Univariable logistic analysis was performed, excluding 1069 variables with P > 0.05.The remaining 279 features underwent dimensionality reduction using LASSO and ten-fold cross-validation.As the logarithm of the harmonic parameter (l) changed on the horizontal axis, the AUC on the vertical axis also changed.The corresponding number of  selected variables is shown in Figure 2A.A risk factor classifier was constructed using LASSO (Figure 2B), with 18 features selected (Table 2).The optimal l value was 0.0173, with a logarithm of -4.055.
Based on the LASSO results (Table 3), Rad-score was calculated.The specific calculation formula is provided in the Supplementary Data.

Construction of clinical-radiomics machine learning models
The Rad-score calculated above was combined with clinical characteristics to construct CART, KNN, LASSO, NB, RF, SVM, and xGBoost models (Table 2).Additionally, we constructed a clinical model for comparison using stepwise logistic regression based on the clinical characteristics of the patients.All models demonstrated high predictive ability in the training set, with acceptable consistency in the validation and prospective sets (Figures 3A, B).
A comprehensive evaluation of precision, F1 score, ROC curve, and AUC values in the training, validation, and prospective sets revealed that the RF model had a more stable predictive performance, followed by xGBoost and LASSO (Figures 3C-E).According to DCA, it is evident that xGBoost can significantly benefit patients (Figure 3F).
The SHAP value and SHAP plot were used to display the importance of each variable in the RF and xGBoost models.According to the SHAP, the variable importance of the two models is similar, and both can reflect that the Rad-score has the most significant impact.At the same time, other clinical characteristics such as Hemoglobin (Hb), age, Body Mass Index (BMI), gender, and diabetes mellitus also influenced the difficulty (Figure 4).

Discussion
AT has become a hot topic in the medical field, and surgery is the primary treatment method.TPLA and RPLA were proposed in

Model
Set AUC Sensitivity Specificity Precision F1-score 1992 (8,23), respectively, and have continuously improved.The advantage of RPLA is that it results in less surgical trauma and bleeding, faster recovery, and fewer complications.Moreover, it is also suitable for some cases that traditional surgery finds challenging, such as obesity and complex AT.There are some relatively objective analysis systems for the surgical difficulty of TPLA, while there is less analysis on RPLA (10,17).This study retrospectively analyzed 396 patients who underwent RPLA for AT.The LASSO analysis of radiomics features was used to calculate the Rad-score.By combining the Rad-score with preoperative clinical characteristics, ML models such as CART, KNN, LASSO, NB, RF, SVM, and xGBoost were constructed and compared.It was found that RF had a more stable prediction accuracy, while xGBoost could bring more significant benefits to patients.The ML model suggested that in addition to the most influential Rad-score, the clinical characteristics such as Hb, age, BMI, gender, and diabetes mellitus also greatly influenced surgical difficulty.Through the validation of the validation set and prospective set, it was found that the ML models had high predictive ability.Through the comprehensive comparison of different models, it was found that the RF model exhibits the best prediction performance, thus making it our recommended model.Furthermore, in comparison to clinical models in previous study (16), our RF model exhibited superiority as evidenced by 2000 Bootstrap tests (D = 7.155, P < 0.001).The discrimination power of models can be effectively compared using two measures: the Net Classification Index (NRI) and the Integrated Discrimination Improvement (IDI).In comparison to previous studies, the RF model in this study demonstrated an NRI of 0.308 (95% CI: 0.194-0.422,p < 0.001) and an IDI of 0.165 (95% CI: 0.119-0.210,p < 0.001).
The Rad-score calculated based on LASSO significantly impacts the surgical difficulty of RPLA.When performing univariate logistic regression, 279 features were statistically significant.After LASSO, 18 variables were retained and used to construct the Rad-score.The final retained variables included "Shape Features" like "Maximum3DDiameter".Moreover, many studies have generally confirmed that the maximum diameter of the tumor is an essential factor affecting the difficulty of removing AT (9,10,(16)(17)(18)(19)(20).In addition, "First Order Features", which are linearly correlated with the CT value of the tumor, such as "90Percentile" were also included.Malignant and benign AT have different degrees of enhancement during arterial enhancement, which increases the risk of bleeding during surgery (18,24,25).It may also be because lipid-rich AT has lower CT values and requires more attention during surgery to prevent breaking the capsule, which prolongs the operation time (9,16,20)."DifferenceEntropy" in "GLCM Features" measures the randomness or complexity of differences between pixel intensity values.It was included because malignant tumors, such as metastases, exhibit more randomness or complexity between pixel intensity values, while their removal is more challenging than benign tumors (18).Some clinical characteristics of patients also affect the difficulty of RPLA.Patients with diabetes mellitus are more likely to have perirenal fat adhesions, which affect surgical difficulty (26).Studies by Chen (17) and Takeda (27) have also shown that diabetes mellitus significantly affects it.Some studies also suggest that a history of hypertension and coronary heart disease affects surgical difficulty (28,29).BMI is used to assess the degree of obesity and also affects it.However, it mainly reflects the overall body fat composition, while the distribution of visceral fat, especially perirenal fat, may differ (9,16).Therefore, there is still controversy over BMI prediction of surgical difficulty.Some studies believe that measuring visceral fat would be more accurate (10,25,29).
Hb reflects the patient's blood reserve and blood oxygen reserve situation (30).If it is too low, it will affect the surgery.Age affects almost all tumor surgeries and prognoses because older patients often have poorer nutrition and tolerance.Moreover, diseases tend to be more malignant in older patients (31, 32).In addition, some researchers believe that males may have more dangerous lifestyles (such as smoking), and there are differences in hormone levels between men and women, which may lead to poorer physical conditions and more incredible surgical difficulty in male patients (33,34).This study established ML models for predicting the difficulty of RPLA based on preoperative radiomics and clinical characteristics.It was validated internally and prospectively to prove that the ML models can significantly improve patients' net benefit rate.
There needs to be more accurate prediction models for the difficulty of RPLA.The innovation of this study lies in combining ML with radiomics to analyze the risk factors for the difficulty of RPLA and establish prediction models for it, then conduct internal validation and prospective validation to make the model more meaningful.Moreover, this study is currently one of the largest cohorts using radiomics to predict the difficulty of RPLA.
The prospects of this study include: external validation to confirm its stability and accuracy further; using radiomics to analyze the tumor's surrounding environment while analyzing AT and optimizing the model through more ML algorithms.Some studies have proposed that magnetic resonance imaging has multiple weighted sequences, which may have better effects when applied to radiomics than CT.Although the accuracy of this study's models is high, the time cost of drawing ROI is high.If further promotion or clinical transformation is needed, combining deep learning to train artificial intelligence to draw ROI is necessary.Some studies have successfully trained artificial intelligence to draw ROIs for pancreatic duct tumors and predicted lymph node metastasis and prognosis based on them.Its sensitivity and specificity are superior to clinical and radiomics models (35).
In conclusion, Rad-score, Hb, age, BMI, gender, and diabetes mellitus affect RPLA surgical difficulty.The ML prediction model established based on patient clinical characteristics and Rad-score using RF and xGBoost has good predictive performance.Through the above model, surgeons can effectively evaluate the difficulty of RPLA, thereby reducing surgical risks and improving patient benefits.

( 1
FIGURE 1 The process of this study.(A Flowchart of this study; (B) Original CT images; (C) Drawing of regions of interest [ROIs]; (D) 3D reconstruction of the ROIs; (E) Venn plot of the reasons for the difficulty of surgery).

3
FIGURE 3 The performance of machine learning models.(A Calibration curve of validation set; (B) Calibration curve of prospective set; (C) Receiver operating characteristic [ROC] curves of machine learning models in training set; (D) ROC curves of machine learning models in validation set; (E) ROC curves of machine learning models in prospective set; (F) decision curve analysis curves of machine learning models).

TABLE 1
Baseline clinical and radiomics characteristics of patients.

TABLE 2
Comparison of machine learning model performance.

TABLE 3
Radiomic features selected by LASSO.