Automated Machine Learning Model Development for Intracranial Aneurysm Treatment Outcome Prediction: A Feasibility Study

Background: The prediction of aneurysm treatment outcomes can help to optimize the treatment strategies. Machine learning (ML) has shown positive results in many clinical areas. However, the development of such models requires expertise in ML, which is not an easy task for surgeons. Objectives: The recently emerged automated machine learning (AutoML) has shown promise in making ML more accessible to non-computer experts. We aimed to evaluate the feasibility of applying AutoML to develop the ML models for treatment outcome prediction. Methods: The patients with aneurysms treated by endovascular treatment were prospectively recruited from 2016 to 2020. Treatment was considered successful if angiographic complete occlusion was achieved at follow-up. A statistical prediction model was developed using multivariate logistic regression. In addition, two ML models were developed. One was developed manually and the other was developed by AutoML. Three models were compared based on their area under the precision-recall curve (AUPRC) and area under the receiver operating characteristic curve (AUROC). Results: The aneurysm size, stent-assisted coiling (SAC), and posterior circulation were the three significant and independent variables associated with treatment outcome. The statistical model showed an AUPRC of 0.432 and AUROC of 0.745. The conventional manually trained ML model showed an improved AUPRC of 0.545 and AUROC of 0.781. The AutoML derived ML model showed the best performance with AUPRC of 0.632 and AUROC of 0.832, significantly better than the other two models. Conclusions: This study demonstrated the feasibility of using AutoML to develop a high-quality ML model, which may outperform the statistical model and manually derived ML models. AutoML could be a useful tool that makes ML more accessible to the clinical researchers.


INTRODUCTION
Endovascular therapy is widely used in the treatment of intracranial aneurysms (1). Despite a remarkable advancement of the endovascular coiling for intracranial aneurysms, there still exists a high rate of recurrence and recanalization. It has been reported that the recanalization rate for coiling and flow diversion are 20.8 and 10.2%, respectively (2,3). Approximately up to 50% of patients who succumbed to recurrence or recanalization necessitated further treatment, which may place an additional financial burden on the patients. Moreover, recanalization puts patients at increased risk of a thromboembolic event or aneurysm rupture. Many studies have tried to study the risk factors for recanalization. The aneurysm size, morphologies, treatment strategies, and hemodynamics have been found to be associated recanalization (4)(5)(6)(7)(8)(9). Some studies have tried to develop the models or grading scales to predict treatment outcome (4,(10)(11)(12). However, evaluation of some of the grading scales showed relatively poor performance (13).
In recent years, the machine learning (ML) models, as an alternative to the conventional statistical model, have shown promise in many clinical areas (10,14,15). ML models can learn complex relationships from a large amount of data. Compared with a regression model that focus on statistically significant variables, the ML models can discover non-intuitive patterns from variables which may be overlooked by statistical test (16).
Although the ML models have shown outstanding performance, the development of such models requires expertise in ML. Despite the existence of open-source code libraries, such as Scikit-Learn, PyTorch, and Tensorflow, their use still requires significant experience in programming and knowledge of ML. In addition, a high-quality model usually requires expertise to tune and train. All these problems pose a great challenge for the clinical researchers hoping to adopt ML in their research.
The recently emerged automated machine learning (AutoML) has found a way to close the gap between ML and non-artificial intelligence (non-AI) experts. The emergence of AutoML automates the process of building an ML model which in the past relied on data scientists. This lowers the learning threshold for using ML and allows people without expertise in ML to apply ML to their own area. It has recently been reported that AutoML has helped the physicians to develop the ML models that achieved good performance in the field of medical image analysis and disease risk prediction (17,18). However, such success has not been reported in the field of stroke treatment.
Therefore, in this study, we aimed to evaluate the feasibility of using AutoML to develop the ML models for aneurysm treatment outcome prediction. Treatment was considered successful if angiographic complete occlusion was achieved at follow-up. We developed the prediction models for treatment outcome using three different methods: a statistical multivariate regression Abbreviations: AutoML, automatic machine learning; ML, machine learning; AUPRC, area under precision-recall curve; AUROC, area under receiver-operating characteristic curve; SAC, stent-assisted coiling; FD, flow diversion. model, a manually derived ML model, and an AutoML derived ML model, and compared their performance.

Patient Cohorts
The patients were recruited according to the protocol of a prospective cohort (19). The primary endpoints of the cohort study are an evaluation of the safety and efficacy of interventional treatment for 6 months after surgery, with each participant completing at least 1 year of follow-up. Approval for this study was obtained from the local Institutional Review Board. The data used in the current study were anonymous and the requirement for informed consent was therefore waived. From the prospective cohort, we included the aneurysm cases treated by endovascular treatment. Dissecting aneurysms and fusiform aneurysms, aneurysms with prior treatment, or the cases with missing clinical information were excluded. A total of 395 patients were identified from our center. However, due to loss to follow-up or incomplete record, only 182 patients and 218 aneurysms with complete record of angiographic follow-up were used in the current study.

Data Acquisition
The morphological parameters were measured and calculated from three-dimensional digital subtraction angiography (DSA) images prior to treatment. The measurements were done by two independent neurosurgeons and the average of their readings were used. The clinical symptoms, such as feeling of headache, nausea, vomit, and dizziness were recorded. The blood tests were also performed for the patients prior to treatment to measure lipid level and blood clotting function. Additionally, the patient demographics, medical history, and lifestyle behaviors were recorded. Treatment related parameters, such as treatment method, number of coils stent metal coverage rage (MCR) were included. Immediate angiographic outcome after treatment and follow-up angiographic outcome were also recorded according to the Raymond-Roy Occlusion Classification scale (20). Treatment was considered successful if complete occlusion was achieved at follow-up. The average follow-up time for the coiling and stent-assisted coiling (SAC) cases is 9.4 and 14.2 months for flow diversion cases. The complete list of collected variables is shown in Table 1.

General Procedures of ML
The general procedures of ML include the following steps: feature selection, feature engineering, ML model selection, and hyperparameter tuning, as shown in Figure 1. In feature selection, the features that are relevant to the prediction target are selected based on various criteria, such as ANOVA Fvalue, chi-squared statistics, univariate statistical significance P value, and information gain. Feature selection help to identify and focus on the useful features. In feature engineering, raw features can be normalized, binarized, decomposed, or combined to create new features, which might help to better model the data. In model selection, various ML algorithms are evaluated on the dataset and the best is selected. Common ML algorithms, to list a few, include Support Vector Machine, K-Nearest Neighbors, Decision Tree, Artificial Neural Network, Random Forest, and Naïve Bayes. All these algorithms have a wide range of hyperparameters that require careful adjustment to suit different tasks and datasets. For example, Random Forest have more than a dozen of hyperparameters, such as maximum number of tresses, maximum tree depth, maximum number of features, and minimum samples in leaf. In hyperparameters tuning, the optimal hyperparameters are usually found using grid-search or randomized grid-search over millions of possible combinations of hyperparameters.
A pipeline consists of a combination of specific methods for feature selection, feature engineering, ML algorithm, and a specific set of hyperparameters. To build a good ML model, one needs to identify an optimal pipeline that achieves best performance on the dataset.

Automated Machine Learning
Automated machine learning automates the above pipelines and explores different choices of algorithms, feature selection and feature engineering technique, and hyperparameters. Since each major step in the pipeline involves dozens to millions of choices, complete exploration of all possible pipelines is inefficient and impractical. To speed-up the search process, we employed an algorithm called Tree-based Pipeline Optimization (TPOT) to automate the pipeline search. TPOT is based on the evolutionary algorithm which uses genetic programming to search for optimal pipeline (21). Genetic programming mimics the way of natural selection. Briefly, in each optimization run (generation), TPOT randomly generates multiple pipelines (population). These pipelines were evaluated based on their accuracy (fitness to survive). The best few pipelines (scored by accuracy) were selected into the next optimization run (selection). The selected pipelines were then randomly modified (mutation and crossover) in which a few of the pipeline elements (e.g., ML models, feature selection methods, and feature processing method) are changed. Several generations are run and the pipeline that performed best on the training set is selected as the optimal pipeline.
In the current study, AutoML was used on the training set to obtain an optimal pipeline. To avoid overfitting, 10-fold crossvalidation was used. For the setting of AutoML, the number of generations to run was set to 10 and the population size at each generation was 100. Increasing the number of generations or the population size can result in higher chance of discovering better pipelines but at the cost of computational time. In the current study, the program was run on a desktop computer (CPU: Intel i7 8700) for ∼1 h.
After obtaining the optimal pipeline, the derived model was evaluated on the test set. To further avoid overoptimistic results due to random split of the training and test set, the above procedures were repeated 20 times and each time with a different split of training and test set. The average performance from the 20 repeats was reported. The training and evaluation procedures are shown in Figure 2.

Conventional ML
For comparison purposes, an ML model was trained manually using a typical method found in the literature. Random forest is one of the most popular algorithms used in the literature and is often found to have better performance than logistic regression (22)(23)(24). To represent a typical scenario in which a non-ML-expert develops an ML model for clinical research, we applied the same training procedures as described in the work of Rubber et al. The manual pipeline started with feature processing using normalization, and model building using the random forest algorithm. The hyperparameter of the algorithm (number of trees) was tuned between 5 and up to 5,000 (24). The hyperparameters were tuned using 10-fold crossvalidation on the training set and the model was tested on the test set. The above procedures were repeated 20 times and each time with a different split of training and test set. The average performance of the 20 repeats was reported. The training and evaluation procedures that were used are shown in Figure 2.

Statistical Model Building
All variables of the successful and unsuccessful cases were compared using the univariate analyses. For binary or categorical variables, the Fisher's exact test or chi-square test was performed. For continuous variables, they were first examined with the Shapiro-Wilk test to determine normality, followed by Student's t-test (for normally distributed variables) or Mann-Whitney U-test (for non-normally distributed variables). The variables with P < 0.05 in the univariate analysis were further selected into multivariate analysis using a backward conditional stepwise method. The statistical analyses were performed using SPSS (IBM Corporation, NY, USA). The variables that remained statistically significant (P < 0.05) in multivariate analysis were used for the statistical model building. For a fair comparison with other methods, a logistic regression model was fitted on the training set and evaluated on the test set. The training and evaluation procedures were also repeated 20 times and each time with a different split of training and test set. The average performance from the 20 repeats was reported.

Model Comparison
Though unsuccessful cases only consist of a small portion (11%) of the dataset, it is more important to identify the unsuccessful cases than the successful cases. To avoid bias introduced by imbalanced data, besides the commonly used receiver operating characteristic (ROC) curve, we also used the precision-recall curve (area under the precision-recall curve [AUPRC]) as the evaluation metric, which is more informative than ROC when evaluating classifier on imbalanced data (26). The precisionrecall curve plots precision, also termed as positive predictive value (PPV), against recall (sensitivity). The AUPRC is a balanced measure of the capability of a model to predict unsuccessful cases. The comparison of the performances of three models in the 20 repeats was examined by Wilcoxon signed ranks test as suggested by a previous study (27).

RESULTS
A total of 182 patients with 218 aneurysms were included. The average aneurysm size was 5.3 mm. The majority of them were located on the internal carotid artery (ICA), followed by the middle cerebral artery (MCA) and anterior communicating artery (AComA). At follow-up, only 24 cases remained unoccluded. The baselines for the successfully treated and unsuccessfully treated group are summarized in Table 1.
In the univariate analysis, aneurysm size, aneurysm width, aneurysm height, presence of nausea, presence of vomit, use of SAC, aneurysm location in the posterior circulation, and the immediate post-procedure angiographic outcome showed statistical significance. In the multivariate analysis, only aneurysm size, use of SAC, and posterior circulation remained as significant variables, as shown in Table 2.
The sensitivity, positive predictive value, area under the receiver operating characteristic curve (AUROC), AUPRC, and F1-score of the three models are summarized in Table 3 The AUPRC of AutoML derived model was significantly higher than that from the statistical model (P < 0.001) and that from manual derived ML model (P = 0.021) and that from the ARSS model (P = 0.011).
The procedures of applying AutoML in clinical settings are shown in Figure 4. The surgeons first prepare data and then run the few lines of code of AutoML and get an automatically generated Python file that contains the optimal pipeline to build a high-quality ML model. The surgeons can then use the generated python code to train an ML model and predict the risk of recanalization. In the current study, the optimal pipeline obtained started with feature selection using recursive feature elimination with Extra-Trees classifier, followed by feature preprocessing using Normalization. The algorithm used to build the model was the Gradient-Boosting classifier.

DISCUSSION
Recanalization and recurrence are the Achilles' Heel of endovascular treatment. This can only be confirmed by a longterm follow-up study. Thus, the question is raised: are there any methods to predict the long-term outcome of embolization? Recently, the ML models have emerged as alternatives to the traditional statistical models used to predict disease risk and therapeutic effect. However, ML is often recognized as complicated technology accessible only to a small fraction of medical researchers and data scientists. The advantage of AutoML is that it allows non-ML experts to utilize the ML models without prior expertise. In this study, we found that AutoML, with the only minimum amount of code, could develop an ML model that performed significantly better than the commonly used statistical model in predicting treatment success.

Comparison of the AutoML Model and Statistical Model
While the statistical models are easy to derive and understand, they have several limitations. They assume linear independence between the variables which may fail to account for interactions  between the variables. The prescreening of variables using P values may also miss important variables which may not appear statistically significant in a univariate test (28). In contrast, the ML models can learn nonlinear and interactive patterns between variables and thus producing a more accurate prediction model. Many studies have reported that an ML model outperformed the statistical models (22)(23)(24). However, there are several drawbacks that limit the use of the ML model in clinical research. One is the black-box problem of an ML algorithm yet this can be improved by applying model interpretation techniques, such as SHAP (29) to explain the prediction made by the ML models. The other problem is that the development of the ML model requires expertise in ML and usually requires the time-consuming tuning of dozens of parameters. We have shown in the current study that this can be improved by using the recently emerging AutoML technique. AutoML can make ML model training more accessible to non-ML experts without compromise in model performance.

Comparison of an AutoML Model and Manually Derived ML Model
We have demonstrated that an AutoML derived model can achieve better performance than a manually derived ML model. The ML models need a careful selection of algorithms and tuning of hyperparameters to achieve their best performance. However, in many clinical studies that apply ML, such tuning is usually not carried out. Therefore, the developed model may not fully exploit the power of ML. In this study, we followed the same procedures mentioned in the literature to manually develop an ML model. This represented a typical scenario in which a non-ML-expert used an open-source library to train an ML model. As a result, the manually developed model is not optimal. In contrast, AutoML can perform extensive searching of different pipelines and tuning of hyperparameters, which resulted in a better model. It has been reported that AutoML outperformed a conventional ML model manually developed by a researcher with a master's degree in computer science. Moreover, AutoML only took less than an hour to train but achieved similar or even better performance than a manually derived ML model which took days to tune (17). Another distinctive advantage of AutoML compared with the conventional ML procedure is that it is much easier to use for surgeons with limited background in ML. As we have shown, the use of the AutoML model requires only a few lines of code, which makes it more accessible to clinical doctors.

Limitations
In the current study, all the cases were from a single center and the number of cases was relatively small. Nevertheless, we have demonstrated that the use of AutoML can help clinical researchers develop high quality ML models that outperformed the statistical models and manually trained ML models. Though the current study is a single-center study with limited cases and follow-up time, the AutoML method presented in the current study can be easily generalized to a study with a larger sample size and longer follow-up time. In the current study, the treatment strategies, such as clipping, liquid embolization, or flow disruption were not assessed. To further test the applicability of our model, more cases from multiple centers with longer follow-up should be analyzed.

CONCLUSIONS
We have demonstrated the feasibility of using AutoML to develop high quality ML model for aneurysm treatment outcome prediction. The AutoML derived model accurately predicted the outcome of treatment, which may facilitate treatment planning. AutoML may outperform the conventional statistical model and manually derived machine learning model. The emergence of AutoML simplifies and automates the process of building an ML model, which lowers the learning threshold of ML and allows non-AI experts to apply ML to their research.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Zhujiang Hospital of Southern Medical University. The Ethics Committee waived the requirement of written informed consent for participation.