A Risk-Factor Model for Antineoplastic Drug-Induced Serious Adverse Events in Cancer Inpatients: A Retrospective Study Based on the Global Trigger Tool and Machine Learning

The objective of this study was to apply a machine learning method to evaluate the risk factors associated with serious adverse events (SAEs) and predict the occurrence of SAEs in cancer inpatients using antineoplastic drugs. A retrospective review of the medical records of 499 patients diagnosed with cancer admitted between January 1 and December 31, 2017, was performed. First, the Global Trigger Tool (GTT) was used to actively monitor adverse drug events (ADEs) and SAEs caused by antineoplastic drugs and take the number of positive triggers as an intermediate variable. Subsequently, risk factors with statistical significance were selected by univariate analysis and least absolute shrinkage and selection operator (LASSO) analysis. Finally, using the risk factors after the LASSO analysis as covariates, a nomogram based on a logistic model, extreme gradient boosting (XGBoost), categorical boosting (CatBoost), adaptive boosting (AdaBoost), light-gradient-boosting machine (LightGBM), random forest (RF), gradient-boosting decision tree (GBDT), decision tree (DT), and ensemble model based on seven algorithms were used to establish the prediction models. A series of indicators such as the area under the ROC curve (AUROC) and the area under the PR curve (AUPR) was used to evaluate the model performance. A total of 94 SAE patients were identified in our samples. Risk factors of SAEs were the number of triggers, length of stay, age, number of combined drugs, ADEs occurred in previous chemotherapy, and sex. In the test cohort, a nomogram based on the logistic model owns the AUROC of 0.799 and owns the AUPR of 0.527. The GBDT has the best predicting abilities (AUROC = 0.832 and AUPR = 0.557) among the eight machine learning models and was better than the nomogram and was chosen to establish the prediction webpage. This study provides a novel method to accurately predict SAE occurrence in cancer inpatients.


INTRODUCTION
Cancer is a constant challenge for public health in the world. It has become the second leading cause of death after cardiovascular disease which seriously threatens human health. The statistical report announced by the International Agency for Research on Cancer (IARC) in 2020 predicts that the global cancer burden is expected to reach 29 million new cancer cases per year until 2040, an increase of 62% over the estimated 18.1 million cancers in 2018 (Wild Cp, 2020). As the most populous country in the world, China accounts for about 23% of new cancer cases and 30% of cancer deaths (Hyuna et al., 2021). A survey shows that the direct economic burden caused by cancer in China was $221.4 billion which accounted for 5.4% of the total health expenditure and 17.7% of the government's public health expenditure in 2015 (Cai et al., 2017).
With the increasing incidence rate of cancer, the research on the methods of treating cancer has also been deepened. The increasing antineoplastic drugs such as molecular targeted therapy and immunotherapy have effectively controlled many cancers. However, the drug-induced safety problems cannot be ignored, which not only affect the treatment of patients but also some patients interrupt treatment or even die because of serious adverse events (SAEs) caused by antineoplastic drugs (Zhiwei et al., 2019). Compared with clinical trials, patients who receive chemotherapy have a higher frequency of SAEs in clinical practice, which has been reported in the systematic evaluation of lung cancer treatment (Prince et al., 2015). A retrospective study from Japan that investigated the types and frequencies of SAEs after oral antineoplastic drugs in outpatients has found that SAEs usually occurred early after the beginning of the treatment (Kenji et al., 2021). SAEs led to deterioration in the quality of life, increased healthcare costs, and earlier morbidity and mortality (Bates et al., 1995). Hence, SAEs in cancer patients were considered an important event with high clinical value. Early identification and warning of individuals associated with SAEs are particularly important.
The Global Trigger Tool (GTT) was first proposed by the Institute for Healthcare Improvement (IHI) in 2003; it is a commonly used method for identifying potential adverse drug events (ADEs) among cancer inpatients (Lipitz-Snyderman et al., 2017). Anne et al. (2020) described the ADEs of cancer patients with the GTT, in which the positive predictive value (PPV) was 42%. Christin et al. (2017) used the GTT to investigate whether hospitalized cancer inpatients are at higher risk of ADEs than a general hospital population, and it has been found that higher age, longer length of stay, and surgical treatment were the risk factors of ADEs in cancer inpatients compared with other patients. Although certain studies have reported a variety of predictive factors for ADEs, such as patient illness severity, patient increased age (>65 years), receiving more than five drugs, and length of hospital stay, the findings are partly contradictory (Simon et al., 2011;Paul et al., 2011;Chuenjid et al., 2013;Qiaozhi et al., 2020). The GTT has certain capabilities in detecting ADEs, but some studies have shown that the GTT is not specific enough in studying the harm to cancer patients, and the PPV of the GTT is generally low, which varies greatly between different populations and medical centers (Otto et al., 2013).
Machine learning is a new artificial intelligence discipline, which has been widely used to assist doctors to make an objective judgment (de Mattos et al., 2022;Höppner, 2020). In this study, the GTT was first used to actively monitor the occurrence of ADEs and SAEs caused by antineoplastic drugs. Then, the machine learning method was used to explore the relevant risk factors of SAEs caused by antineoplastic drugs and construct predictive models, to make up for the poor performance of the GTT. Our study tries to establish a machine learning model to quantitatively predict the probability and degree of SAEs of antineoplastic drugs, to provide a risk prediction tool for clinical work and take effective measures.

Study Participants
A retrospective medical record's review was performed for a random sample of 600 inpatients (50 per month) in Chongqing Cancer Hospital discharged from January 1 2017 to December 31, 2017. The inclusion criteria were patients diagnosed with cancer, whose length of stay >2 days and ≤30 days, and antineoplastic drugs used during hospitalization. The exclusion criteria were as follows: patients who had no antineoplastic drug exposure and had used traditional Chinese medicine to treat cancer.
This study is a retrospective study and the patients' informed consent is not required. The protocol of this study has been approved by the Ethics Committee of Chongqing University Cancer Hospital (CZLS2022008-A) and the Ethics Committee of Chongqing Medical University.

Positive Cases
First, the GTT method was used to detect the occurrence of ADEs. Subsequently, two pharmacists were assigned to examine the data and determine the occurrence of ADEs. If there were disagreements, the final decision was made by a senior pharmacist. Finally, SAE patients were selected from all ADE patients according to CTCAE 5.0, and events with grades 3-5 were defined as SAEs (National Institutes Of Health and National Cancer Institute, 2017).

Candidate Predictors
The SAE risk factors were screened from multiple patient characteristics according to the results of previous research (Simon et al., 2011;Paul et al., 2011;Chuenjid et al., 2013;Qiaozhi et al., 2020). To be specific, we included the patients' demographic information (such as sex, age, and weight), disease situation (such as cancer types and cancer stage), treatment information (such as number of antineoplastic drugs and number of combined drugs), and the number of GTT triggers. The occurrence of SAEs was used as the target variable to analyze which characteristic had a remarkable influence on it.

Statistical Analysis
The whole dataset was divided into training and test cohorts at the ratio of 8:2 according to a random number and the test cohorts were used to verify the performance of the model. All statistical computing was conducted in R for Windows (version 4.0.5, https://www.r-project.org/) and SPSS 25.0 (IBM Corporation, Armonk, NY, USA). p < 0.05 was considered to be statistically significant. Data were presented as count with percentage for categorical variables, median with interquartile range, or mean with standard deviation for continuous variables. The Mann-Whitney U-test or T-test was performed for the continuous variables, and the Chisquare test for categorical variables. Least absolute shrinkage and selection operator (LASSO) analysis carried out used to explore the interaction of variables screened by the univariate analysis on the occurrence of SAEs. Subsequently, using the variables after the LASSO analysis as covariates, the nomogram based on the logistic model, extreme gradient boosting (XGBoost), categorical boosting (CatBoost), adaptive boosting (AdaBoost), light gradient boosting machine (LightGBM), random forest (RF), gradient boosting decision tree (GBDT), decision tree (DT) algorithms, and ensemble model based on seven machine learning algorithms were used to establish prediction models. Precision, recall, F1, sensitivity (SEN), specificity (SPE), area under the PR curve (AUPR), and area under the ROC curve (AUROC) were intended to determine the predictive ability. The evaluation indicator formulas were shown in our previous research (Ze et al., 2021). At the same time, we also performed a logistic analysis on the results of the univariate analysis and the established nomogram, compared with the results of the machine learning model. Ultimately, the algorithm with the best performance was selected to establish the model to predict the occurrence of SAEs.

Study Population
The hospital had 43,663 medical records from January-December 2017. According to the inclusion and exclusion criteria, a total of 499 patients (cases) were selected in this study. The specific screening process and study protocol are shown in Figure 1.
In the process of SAE identification, we established 33 kinds of triggers, among which 30 triggers were positive (90.91%) in our study. A total of 620 ADEs were identified from the 30 triggers. Among the  Table 1.
In the whole cohort, the average age of patients was 53.97 ± 11.91 years, ranging from 13-88 years, females accounted for 61.32% (306 cases) and males 38.68% (193 cases). The mean length of stay was 9.32 ± 5.07 days (3-30 days). The most common type of cancer was breast cancer (121 cases, 24.25%), followed by lung cancer (102 cases, 20.44%) and lymphoma (56 cases, 11.22%). The cancer stage was mainly concentrated in stage Ⅲ~Ⅳ (326 cases, 65.33%), and Karnofsky performance status (KPS) scores were more than 70 before chemotherapy (449 cases, 89.98%). The relationships of these factors with the occurrence of SAEs need further screening in the following sections. According to table 2, there were 27 kinds of suspected drugs leading to SAEs, and the number of medications was 683 times; plant origin and other derivatives account for the largest proportion of suspected drugs of SAEs (31.77%), followed by platinum metal (24.16%), alkylating agent (16.40%), antineoplastic antibiotics (15.08%), antimetabolic drugs (12.15%), and molecular targeted drugs (0.44%).

SAEs and Risk Factors
According to Table 3, there is no significant difference between the training and test cohorts (p > 0.05), except that sex and radiation therapy have a slightly lower p-value (p < 0.05). Univariate analysis results indicated that eight variables were statistically significant between the SAE group and no SAE group in training cohorts, including sex, cancer type, ADEs occurred in previous chemotherapy, age, length of stay, number of previous chemotherapies, number of combined drugs, and number of triggers, while other eight variables were not statistically significant. We used the LASSO analysis to further screen the variables after the univariate analysis to avoid collinearity of variables and simplify the model variables. The result suggested that the log of the optimal value of lambda was 6 ( Figure 2). Thus, six variables were selected as machine learning model predictors.
They are sex, ADEs occurred in previous chemotherapy, age, length of stay, number of previous chemotherapies, and number of triggers.

Logistic Model and Nomogram Establishment
To build a risk-factor model, the six variables which were statistically significant were used as input variables, and whether SAEs occurred after the use of antineoplastic drugs was regarded as the outcome event (yes = 1, no = 0) to establish the prediction model. The results of the stepwise forward logistic regression showed that age, length of stay, and number of triggers were screened and entered into the final model (Table 4). We have drawn a nomogram based on these three indicators (Figure 3), and added up the points of each indicator that could get the probability of SAEs occurrence. The test cohort was used to verify the performance of the nomogram. Among the test cohort, the Brier of the nomogram was 0.189, the AUPR was 0.527, and the AUROC was 0.779 (Figure 4), indicating that the model had a good performance.

Machine Learning Model Establishment and Comparison
In Table 5, the metrics of eight models were compared in terms of SEN, SPE, AUROC, AUPR etc. in the test cohort. Among the eight models, the GBDT has the highest precision (0.621) and with the highest values of F1 (0.667), but owns a moderate recall (0.720). In addition, the visual comparisons of the ROC are shown in Figure 5, where the GBDT model achieves the highest AUROC of 0.832 and higher than the nomogram's AUROC of 0.799. The SPE of the GBDT model was 0.853, suggesting that the GBDT model also has good value in identifying SAE-negative patients. Figure 6 shows the PR curves of the eight models, the GBDT model also outperforms the other seven models, with the AUPR of 0.557. It can be seen that the GBDT model outperforms the other models in the aspect  of precision, F1, AUPR, and AUROC, demonstrating a good ability for model prediction. Under overall consideration of the predicting performance, we chose the model using the GBDT algorithm over the others to predict the occurrence of SAEs. Among the GBDT model, the importance of six variables ranks as follows: number of triggers, age, number of combined drugs, length of stay, ADEs occurred in previous chemotherapy, and sex ( Figure 7). In addition, our webpage SAE risk prediction calculator using the GBDT algorithm model can be accessed through https://cqmugj.shinyapps.io/SAEs_diagnostic__tools/.

DISCUSSION
Medical electronic records have developed from data storage to data utilization, which can potentially guide clinical decisionmaking and predict important results (Ibrahim et al., 2020). It is a low-cost, feasible, and effective method to use medical electronic records and machine learning algorithms to predict the occurrence of SAEs of antineoplastic drugs. We first made a preliminary analysis of ADEs of antineoplastic drugs by the GTT method that used the data of 499 cancer inpatients, and SAEs were identified from patients with ADEs. After that, we constructed a probability prediction model of SAEs in cancer inpatients using the nomogram and machine learning method so that clinical workers can intervene in time when SAEs occurred. We observed that the risk factors of SAEs in cancer inpatients were the number of triggers, length of stay, age, number of previous chemotherapies, ADEs occurred in previous chemotherapy, and sex. Similar to the study of Ze et al. (2021), our study also introduced the number of triggers as a variable in the prediction model. We found that the number of triggers is the most important risk factor. Increasing the number of triggers could better predict the probability of SAEs of antineoplastic drugs. The GTT studies are characterized by a great methodological heterogeneity because the GTT is typically adapted to the local context by removing modules (Härkänen, 2014;Doupi et al., 2015;Hibbert et al., 2016;Xiao-Di et al., 2016;Jee-In et al., 2018), adding triggers and specific definitions (Lau and Kirkwood, 2014), or adding new modules before implementation. A German study which focuses on ADE identification in surgery and neurosurgery shows that new triggers should be added in the process of identifying ADEs to adapt to the new environment (Mareen et al., 2019). Therefore, we suggest that the number of triggers should be combined with other important risk factors to predict SAEs better.
We also confirmed three risk factors which were the length of stay, age, and sex. These three risk factors were proved in previous studies (Nazer et al., 2014;Christin et al., 2017;Weingart et al., 2020). Previous researchers have proved that there is a strong correlation between the length of stay and the incidence of ADEs (Classen et al., 2011;Sezgin et al., 2013). The risk of ADEs increases by 5.1% every day (Christin et al., 2017). However, the length of stay is usually affected by other factors, such as the severity of the disease. Moreover, the increase in the length of stay may be a result of the occurrence of SAEs. Therefore, the causal relationship between the length of stay and SAEs needs to be further evaluated. In addition, age is also an important risk factor. This may be related to more types of drugs used in younger patients. In our study, the number of previous chemotherapies and number of combined drugs in younger patients were higher than those in older patients (Andrew and Lisa, 2012). It should be noted that in the field of drug treatment and drug delivery, some investigators have discovered that sex differences could influence pharmacokinetics and pharmacodynamics and drug toxicity (Bernd et al., 2002;Janice, 2003). However, in this study, sex is a risk factor for SAEs in cancer inpatients which is inconclusive in existing studies. Therefore, further research is required on this factor.
Of note, we also found that number of previous chemotherapies and any ADEs in previous were also risk factors for SAEs in cancer inpatients. The potential reason for the positive correlation between SAEs and the number of previous chemotherapies and was there any ADEs in previous may be the two factors leading to the worse physical state of patients, and SAEs are more likely to occur in the case of poor physical state (Ekkamol et al., 2018).
From the perspective of the overall performance of the model, the performance of the logistic-based nomogram was not as good as the performance based on the machine learning algorithm.
Logistic regression is widely used in the medical field to explore the risk factors of diseases because of its strong interpretability. The transparency of the nomogram established based on the logistic model could solve the black box problem of the machine learning model, but it has the disadvantage of underfitting when building the model, and the overall performance of the model is not high. However, the indicators selected by machine learning were more than those selected by the nomogram in this study, which may be one of the reasons why the performance of machine learning was better than the nomogram. Machine learning is an emerging artificial intelligence discipline that can describe the complex non-linear relationship between independent variables and dependent variables, and the resulting impressive forecast ability (Fabrizio et al., 2021). In our study, the AUROC values of the algorithms other than the DT algorithm reached more than 0.7, indicating good predictive ability. The DT is a traditional machine learning algorithm that can build a classification model based on the information gained from the predictors, so it is optimal in terms of model interpretability (Höppner, 2020). However, the decision tree algorithm is easy to fall into overfitting, and it is easy to fall into local optimum. It has been proved in many works of literature that its performance is not as good as other algorithms. Compared with other machine learning models, the GBDT has the best comprehensive performance, with an AUROC of 0.832 (0.744, 0.920), and an AUPR of 0.557. The possible reason is that the six predictors were finally included in this study, and the GBDT algorithm has obvious advantages over the other machine learning algorithms in dealing with lowdimensional and non-linear data (Yuhui et al., 2022). In addition, light GBM has the highest SEN (0.840) and AdaBoost has the highest SPE (0.863), suggesting that they have an advantage in predicting positive and negative cases. Furthermore, we also built an ensemble learning model combining the results of the seven algorithms, with an AUROC of 0.797 (0.694, 0.899), and an AUPR of 0.557.  Ensemble learning achieves significantly better generalization performance than a single learner by combining multiple learners and also achieves good results in our dataset (Makoto et al., 2021;Menglin et al., 2021). In this study, we established a prediction model for SAEs of cancer inpatients using antineoplastic drugs. Researchers can incorporate the risk factors identified in our study into web pages to determine the probability of SAE occurrence in cancer inpatients. However, this study also has some limitations. This study was a retrospective study and may lack some valuable features that limit the selection of variables for modeling. Furthermore, this study was a single center and small sample study, which fails to externally verify the prediction results of the model in multi-center and large samples. In the future, a largescale, multi-center, and prospective study is needed for verification.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material; further inquiries can be directed to the corresponding authors.

ETHICS STATEMENT
This study is a retrospective study, and patients' informed consent was not required. This study was approved by the Ethics Committee of Chongqing University Cancer Hospital (CZLS2022008-A) and the Ethics Committee of Chongqing Medical University.