Predicting All-Cause Mortality Risk in Atrial Fibrillation Patients: A Novel LASSO-Cox Model Generated From a Prospective Dataset

Background: Although mortality remains high in patients with atrial fibrillation (AF), there have been limited studies exploring machine learning (ML) models on mortality risk prediction in patients with AF. Objectives: This study sought to develop an ML model that captures important variables in order to predict all-cause mortality in AF patients. Methods: In this single center prospective study, an ML-based mortality prediction model was developed and validated using a dataset of 2,012 patients who experienced AF from November 2018 to February 2020 at the First Affiliated Hospital of Shantou University Medical College. The dataset was randomly divided into a training set (70%, n = 1,223) and a validation set (30%, n = 552). A total of 122 features were collected for variable selection. Least absolute shrinkage and selection operator (LASSO) and random forest (RF) algorithms were used for variable selection. Ten ML models were developed using variables selected by LASSO or RF. The best model was selected and compared with conventional risk scores. A nomogram and user-friendly online tool were developed to facilitate the mortality predictions and management recommendations. Results: Thirteen features were selected by the LASSO regression algorithm. The LASSO-Cox model achieved an area under the curve (AUC) of 0.842 in the training dataset, and 0.854 in the validation dataset. A nomogram based on eight independent features was developed for the prediction of survival at 30, 180, and 365 days following discharge. Both the time dependent receiver operating characteristic (ROC) and decision curve analysis (DCA) showed better performances of the nomogram compared to the CHA2DS2-VASc and HAS-BLED models. Conclusions: The LASSO-Cox mortality predictive model shows potential benefits in death risk evaluation for AF patients over the 365-day period following discharge. This novel ML approach may also provide physicians with personalized management recommendations.


INTRODUCTION
AF is one of the most common chronic cardiovascular health problems globally (1)(2)(3). In Europe and the USA, 2-3 % of the population suffers from AF (4), and it is estimated that AF will affect 6-12 million people in the USA by 2,050 and 17.9 million people in Europe by 2,060 (5,6). The incidence of AF is not high among young people but increases with age, reaching more than 10 % in those >80 years of age (7). The inevitable global aging of the population, combined with a cumulative increase in chronic cardiovascular diseases, will lead to considerable growth in the number of AF patients in the next few decades. AF is associated with a nearly five-fold increased risk of ischemic stroke (8,9), and provokes significant increases in all-cause mortality along with important financial burden (10,11). Consequently, higher risk of all-cause mortality associated with AF has become a significant public health issue (1,(11)(12)(13).
Several classic risk scores, including CHA 2 DS 2 -VASc and HAS-BLED scores, predict clinical outcomes, such as for stroke, bleeding and mortality (14)(15)(16)(17). Machine learning can learn to identify the underlying pattern and classes from multidimensional data by utilizing computational algorithms (18). Based on novel ML algorithms, more accurate and intelligent models, such as the Global Anticoagulant Registry in the Field (GARFIELD)-AF risk model and the Multilayer Neural Network artificial intelligence model, have been developed (19)(20)(21). In contrast to the high awareness regarding clinical outcomes of AF in Europe and the USA, there is limited knowledge for East Asia. In addition, few ML models have used multi-dimensional features to predict future mortality of AF patients.
Advances in supervised ML allow the recognition and translation of multi-dimensional data into valuable models (21,22). The use of machine learning for predicting clinical outcomes may enable physicians to improve efficiency, reliability, and accuracy of management decisions. In the present study, we used multiple ML approaches that included LASSO feature selection and the Cox proportional hazards regression model to predict all-cause mortality outcome over the 30-365-day period after discharge in patients with AF.

Study Cohort
For machine learning model construction, a prospective observational study was undertaken using data from patients who were hospitalized for evaluation and treatment of AF between November 2018 and February 2020 at the First Affiliated Hospital of Shantou University Medical College. Inclusion criteria were a diagnosis of AF and availability of complete data concerning clinical indicators for evaluating AF and follow-up. The diagnosis of AF required recording the heart rhythm by electrocardiogram (ECG). Three diagnostic criteria shown by ECG are: (1) absolutely irregular RR intervals, (2) no discernible, distinct P waves, and (3) an episode lasting at least 30 s. Many individuals with AF have both symptomatic and asymptomatic episodes. The exclusion criteria were pregnant women, age ≤ 18, or patients who refused follow-up.

Data Collection
A systemic clinical evaluation for AF was conducted during the hospitalization when patients were enrolled. Overall, 122 variables were initially used for the selection of key features (Supplementary Table 1), which included medical histories, physical examinations, laboratory examination results, medications, comorbidities, ultrasonic cardiogram, CHA 2 DS 2 -VASc score, and HAS-BLED score. Follow-up by outpatient follow-up and/or telephone interview was carried out at 30, 180, and 365 days after discharge. The main outcome of the AF cohort was all-cause death.
This study complied with the principles of the Declaration of Helsinki and was approved by the Ethics Committee of the First Affiliated Hospital of Shantou University Medical College. All participants provided written informed consent to participate in this study. All procedures were performed in conformity with the European Society of Cardiology guidelines (23).

Variable Selection and Model Development
Due to the 122 variables present in the dataset, conducting variable selection was necessary and could lead to improved prediction performance. Both the LASSO algorithm (24) and RF (25) were used to select the features for model training. The top 20 predictor variables were chosen using RF based on relative variable importance (26).
We used five algorithms, including Cox regression, RF, support vector machines (SVM) (27), backpropagation neural networks (BP-NN) (28), and gradient boosting (GB) (29), to train models using the variables that were selected by LASSO and RF. Ultimately

Statistical Analyses and Model Performance Measures
Statistical analyses were performed using SPSS 23.0 (Inc., Chicago, Illinois, USA), X-tile 3.6.1 (30), and R (version 4.0.2; R Foundation for Statistical Computing, Vienna, Austria) software. Continuous variables are presented as the mean ± standard derivation. We used multiple imputation to account for missing data on continuous variables if missing data was <30% (31). Missing values were imputed using the "mice" package. Categorical variables are presented as numbers and percentages. Statistical differences of continuous variables were examined by two-tailed t-tests or Mann-Whitney U tests. Categorical variables were analyzed by the chi-square test or Fisher exact test. Various R packages were used to conduct this study. The glmnet package was used for logistic regression with LASSO regularization (32). Random forest, e1071, neural net, and gbm packages were used for the RF, SVM, BP-NN, and GB models, respectively (29,33).
The predictive accuracy of the LASSO-Cox model was compared with the performances of CHA 2 DS 2 -VASc and HAS-BLED scores. The performances of the models were assessed by the AUC derived from receiver operating characteristics curves. A nomogram for predicting the 30-, 180-or 360-day survival was established using the LASSO-Cox regression model, and the cut-off value for mortality risk stratification was calculated. The nomogram and calibration plots were generated with the rms package. The pROC package was used to plot ROC curves. Kaplan-Meier curves were produced using the survival package. P < 0.05 was considered to indicate statistical significance.

Patient Baseline Demographics
This study was conducted according to the flow chart shown in Figure 1. Eligible study participants consisted of 1,775 AF patients. A total of 1,223 AF patients were randomly assigned in the training dataset and 552 patients in the validation dataset. Baseline characteristics of the study cohort are shown in Table 1. The mean age was 69.22 years (SD = 12.05 years) for the training dataset and 69.02 years (SD =11.65 years) for the validation dataset. The mean CHA 2 DS 2 -VASc was 3.37 (SD = 1.18) in the training set and 3.19 (SD = 1.80) in the validation set. There were no significant differences in diabetes, atherosclerosis, prior stroke, heart failure, cerebral hemorrhage, cancer, renal insufficiency, bleeding, current smoker status, statin medication, and urine ketone bodies in the training set compared with the validation set. An all-cause mortality end point event occurred for 194 of the 1,775 patients (10.9%, 111 males and 83 females), 143 in the training set (11.7%) and 51 in the validation set (9.2%). There was no significant difference in all-cause death rate between the training and validation set.

Feature Selection and Model Performance Comparison
LASSO coefficient profiles of the 122 variables and ten-fold cross-validation for tuning parameter selection in the LASSO model are shown in Figure 2. Thirteen variables were selected by the LASSO regression algorithm, including CHA 2 DS 2 -VASc, stroke, cancer, red cell volume distribution width-coefficient of variation (RDW-CV), statin medication use, lymphocyte ratio, neutrophil-to-lymphocyte ratio, basophilic granulocyte number, urine ketone body (KET), blood glucose (GLU), blood urea nitrogen (BUN), cholinesterase (CHE), and monoamine oxidase (MAO). In addition, the top-20 variables were selected by the RF algorithm (Supplementary Table 2). Next, we built 10 models using these two sets of selected features, and their prediction performances were described using AUC, sensitivity, and specificity (Figure 3). The key performance of machine learning was evaluated by AUC.

Nomogram Construction
Based on the Cox proportional hazards regression analysis, we identified eight independent risk factors in the training cohort.  Table 3).
A nomogram based on the eight independent features from the training cohort was developed for the prediction of the 30-, 180-, and 365-day survival (Figure 4). The nomogram demonstrated that MAO contributes the most to survival, followed by CHE, KET, BUN, CHA 2 DS 2 -VASc, stroke, statin use, and cancer. The total score, obtained by adding the scores for each of the eight features, helped in estimating the 30-, 180-, and 365-day survival rate for each individual patient.  (Supplementary Figure 1). The calibration plots of our nomogram also showed optimal agreement between the actual observations and the predicted outcomes both in the training set and validation set (Supplementary Figure 2) for all time points. Thus, the above nomogram-based results displayed good accuracy for predicting the 30-, 180-, and 365-day survival of AF patients.

Comparison of the Nomogram With CHA 2 DS 2 -VASc and HAS-BLED Models for Predictive Performance
The time-dependent ROCs of the training and validation sets (Figure 5) based on the nomogram were higher than those based on the traditional CHA 2 DS 2 -VASc and HAS-BLED models. These results indicate that our nomogram has greater potential for accurately predicting prognosis compared to the traditional models. DCA was performed to compare the net benefit of the nomogram with that of the traditional CHA 2 DS 2 -VASc and HAS-BLED scores. Compared to the CHA 2 DS 2 -VASc and HAS-BLED scores, the curve of our nomogram showed larger net benefit (Figure 6). We further converted the nomogram to a web calculator for the clinician's convenience (https://afnom. shinyapps.io/DynNomapp/).
In addition, the optimal cut-off point was determined using the X-tile program to accomplish risk stratification. As shown in Supplementary Figure 3, the optimal cut-off point was 0.8. Thus, we stratified the AF patients into a low-risk group (≤0.8) and high-risk group (>0.8). Kaplan-Meier curves showed that the high-risk group exhibited poorer survival than the low-risk group in both the training and validation sets (Supplementary Figure 4).

DISCUSSION
This study investigated a novel LASSO-Cox model for the prediction of all-cause mortality in patients with AF to identify AF patients at high risk and to provide personalized treatment using a data-driven approach. Several important findings were identified. First, eight independent risk factors predicted allcause mortality, including CHA 2 DS 2 -VASc score, CHE, KET, BUN, MAO, stroke, statin medication use, and cancer. Second, a LASSO-Cox model for 30-, 180-, and 365-day risk prediction was established and validated. Third, the use of the nomogram and risk stratification enables the prediction of mortality for AF patients.
Machine learning can identify non-linear associations and identify interactions in complex and multidimensional variables. The use of the LASSO ML algorithm for variable selection is a well-established method that has been previously utilized for cancer, heart failure, and AF populations (34)(35)(36). The advantages of the LASSO algorithm are high accuracy and stability. Cox proportional hazards regression is a traditional model, that is mainly used to analyze the prognosis of cancer and other chronic diseases. Indeed, our LASSO-Cox model was robust and displayed good discriminatory power in predicting all-cause mortality both in the training and the validation dataset.
There is growing evidence that AF significantly worsens the mortality rate (37)(38)(39). Furthermore, AF is an independent risk factor for higher risk of mortality (11). While worse outcomes among AF patients have been confirmed in various studies from Europe and North America, data from East Asia is limited.
Traditional guidelines in AF have focused on identifying patients with different risks of stroke and major bleeding. Several studies have developed and examined prediction models or risk scores in AF patients for stroke, major bleeding, or composite outcomes, although not exclusively for death outcomes (19,23,40). Recently, a death risk score based on age, biomarkers, and clinical history (ABC) was developed and performed well in two large independent clinical trial cohorts (41). However, the detection of novel biomarkers such as GDF-15 are not easily performed in developing countries and regions.
In this LASSO-Cox model, not taking statins is an independent risk factor for AF-associated death. As recently reported, the levels of total cholesterol (TC) are non-linearly associated with all-cause mortality, as well as cancer and cardiovascular disease mortality, in the American population (42). Thus, it is necessary to maintain TC in a moderate range by statin medication. The GARFIELD-AF and ROCKET AF studies have shown that heart failure and sudden cardiac death are the major reasons for death of AF patients taking oral anticoagulant  medication (38,43). Death risk prediction in these patients may give rise to more intense management of risk factors, such as valvular heart disease, myocardial dysfunction, and coronary heart disease.
Among the independent risk factors of death, the four common laboratory examination indicators, including MAO, BUN, CHE, and KET, are strongly associated with mortality. Contemporary AF trials show that cardiac-related deaths account FIGURE 6 | Decision curve analysis of the nomogram, CHA 2 DS 2 -VASc score and HAS-BLED score. The y-axis represents the net benefit and the x-axis represents the threshold probability. The null plot represents the assumption that no patients survive, while the all plot represents the assumption that all patients survive at a specific threshold probability.
for the vast majority of all deaths, whereas stroke and bleeding represent a small fraction (44). In our study, MAO is recognized as the most important mortality risk factor in AF patients. Elevated MAO is known to be associated with liver cirrhosis and chronic congestive heart failure. Recent studies show that MAO is a major source of deleterious reactive oxygen species (ROS), regulating cardiomyocyte aging or death (45,46). Myocardial ROS are involved in the pathophysiology of cardiovascular diseases such as hypertension and heart failure (47,48), and are important markers of atrial fibrillation in patients after cardiac surgery (49). Thus, MAO inhibition therapy is protective in several settings of cardiac stresses such as pressure overload heart failure, diabetic cardiomyopathy and chronic ischemic heart disease (47). Further studies exploring the potential relationship between AF and ROS are needed.
Increased BUN levels are mainly triggered by impaired renal function, which might be highly related to the occurrence of ischemic stroke in AF patients despite adequate therapeutic warfarin anticoagulation (50). A Swedish study showed that neoplastic disease and renal failure contribute to the increased risk of all-cause mortality in AF patients, which is consistent with our result (11). Declination of cholinesterase is associated with the advanced liver cirrhosis, hepatic failure, and myocardial infarction. Inhibition of CHE has been reported to directly affect the intrinsic cardiac nervous system (51). In addition, increased levels of KET reflects the severity of diabetes, and AF patients with diabetes mellitus have a higher mortality rate (52)(53)(54). Collectively, the above risk factors suggest a renewed emphasis on the management of comorbidities such as liver cirrhosis, renal dysfunction, heart failure, and diabetes mellitus, is essential to improve the overall survival and quality of life in AF patients.
The nomogram could provide clinicians with the opportunity to assess risk of all-cause mortality by using a data-driven approach. An additional strength of the LASSO-Cox model is that the eight predictive factors in this nomogram are widely and easily available internationally. In order to facilitate medical use, the clinical implementation of the LASSO-Cox model can either be based on the nomogram, or preferably an online tool.

Limitations
Several limitations of this LASSO-Cox model should be considered. First, validation of this model was performed using a dataset generated from a single center. The performance of our LASSO-Cox model in external datasets needs be tested by data from other institutions. Second, the LASSO-Cox model did not include information about biomarkers, such as NT-proBNP and hs-cTnT. However, considering that these biomarkers often require additional examination, thus increasing the difficulty of acquisition, our model has good accuracy and ease of application. Third, multiple imputation for the missing values is a potential source of bias. Nevertheless, multiple-imputation is a commonly used rigorous technique for imputation (55).

CONCLUSION
A new LASSO-Cox model for predicting risk of all-cause mortality in patients with AF was successfully developed, and internally validated. The LASSO-Cox model using CHA 2 DS 2 -VASc score, statin medication, medical history (stroke, cancer), and four clinical examination parameters (KET, BUN, MAO, and CHE), performed well and may assist physicians in decision-making when treating AF patients.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding authors.

ETHICS STATEMENT
This study complied with the principles of the Declaration of Helsinki and was approved by the Ethics Committee of the First Affiliated Hospital of Shantou University Medical College. All participants provided written informed consent to participate in this study. All procedures were performed in conformity with European society of cardiology guidelines. All procedures followed were in accordance with the ethical standards of the responsible committee on human experimentation (institutional and national) and with the Helsinki declaration of 1975, as revised in 2000. Informed consent was obtained from all patients for being included in the study.

AUTHOR CONTRIBUTIONS
SW, YQC, and XT: concept and design, data analysis and interpretation, critical revision of article, and approval. YC, MW, ZX, XN, and BW: statistics, data analysis, and drafting of article. YC, SW, JY, CC, YQC, and RL: data collection, data analysis, critical revision of article, and approval. All authors read and approved the final manuscript.