A Machine Learning Model to Predict Intravenous Immunoglobulin-Resistant Kawasaki Disease Patients: A Retrospective Study Based on the Chongqing Population

Objective: We explored the risk factors for intravenous immunoglobulin (IVIG) resistance in children with Kawasaki disease (KD) and constructed a prediction model based on machine learning algorithms. Methods: A retrospective study including 1,398 KD patients hospitalized in 7 affiliated hospitals of Chongqing Medical University from January 2015 to August 2020 was conducted. All patients were divided into IVIG-responsive and IVIG-resistant groups, which were randomly divided into training and validation sets. The independent risk factors were determined using logistic regression analysis. Logistic regression nomograms, support vector machine (SVM), XGBoost and LightGBM prediction models were constructed and compared with the previous models. Results: In total, 1,240 out of 1,398 patients were IVIG responders, while 158 were resistant to IVIG. According to the results of logistic regression analysis of the training set, four independent risk factors were identified, including total bilirubin (TBIL) (OR = 1.115, 95% CI 1.067–1.165), procalcitonin (PCT) (OR = 1.511, 95% CI 1.270–1.798), alanine aminotransferase (ALT) (OR = 1.013, 95% CI 1.008–1.018) and platelet count (PLT) (OR = 0.998, 95% CI 0.996–1). Logistic regression nomogram, SVM, XGBoost, and LightGBM prediction models were constructed based on the above independent risk factors. The sensitivity was 0.617, 0.681, 0.638, and 0.702, the specificity was 0.712, 0.841, 0.967, and 0.903, and the area under curve (AUC) was 0.731, 0.814, 0.804, and 0.874, respectively. Among the prediction models, the LightGBM model displayed the best ability for comprehensive prediction, with an AUC of 0.874, which surpassed the previous classic models of Egami (AUC = 0.581), Kobayashi (AUC = 0.524), Sano (AUC = 0.519), Fu (AUC = 0.578), and Formosa (AUC = 0.575). Conclusion: The machine learning LightGBM prediction model for IVIG-resistant KD patients was superior to previous models. Our findings may help to accomplish early identification of the risk of IVIG resistance and improve their outcomes.


INTRODUCTION
Kawasaki disease (KD) is an acute vasculitis disease with bilateral conjunctival inflammation and atypical rash as the main clinical features. It mainly occurs in infants under 5 years of age (1). The main complication of KD patients is coronary artery lesions (CALs), which are the main reason for the increase in the incidence of acquired heart disease in children (2). Prompt treatment with high-dose (2 g/kg) intravenous immunoglobulin (IVIG) could significantly reduce the manifestations of KD and CALs. However, 10-20% of KD patients are resistant to IVIG (3,4). After initial IVIG administration, recrudescent or persistent fever may occur, and further treatment is required at 48 h after the initial use of IVIG, such as the second administration of IVIG and corticosteroids (5). Therefore, it is of great significance to accurately detect IVIG-resistant KD patients and implement appropriate regimens early. In the past 10 years, many scholars have conducted a large number of studies on IVIG resistance. Egami (6), Kobayashi (7), and Sano (8) constructed three scoring systems based on the characteristics of the Japanese population. Fu (9) retrospectively analyzed 1,177 KD patients and established a prediction model for Beijing children. In 2015, Lin et al. collected data from 248 KD children and constructed the Taiwanese Formosa scoring system (10). Although the abovementioned scoring systems have performed well in their respective research populations, due to the existence of genetic susceptibility, the prediction performance of these systems in Chongqing city is not good ( Table 1) (11,12), which precludes wide application in the early prediction of IVIG resistance in Chongqing. It remains a challenge to develop a new prediction model with better predictive performance for children in Chongqing city, one of the largest cities in western China.
In recent years, with the rapid development of machine learning algorithms and model interpretation methods, machine learning has been applied in many different fields and has shown great potential in assisting clinical diagnosis (13)(14)(15). This study retrospectively analyzed the clinical data of 1,398 KD patients on the medical big data platform of Chongqing Medical University from January 2015 to August 2020 and applied machine learning algorithms to the construction of IVIG-resistant prediction models for exploration. A more suitable prediction model for IVIG-resistant KD in the Chongqing area was developed.

Patients
The data come from the medical big data platform of Chongqing Medical University, which contains the electronic medical record data of 7 medical institutions affiliated with Chongqing Medical University. According to the inclusion and exclusion criteria, the inpatient electronic medical records data of 1,398 patients diagnosed with KD who received treatment on the platform from January 2015 to August 2020 were selected.

Definition and Data Collection
IVIG-resistant KD was defined as KD patients with a persistent or recurrence of fever ≥38 • C at any time from 36 h to 2 weeks after initial IVIG treatment accompanied by one or more of the main symptoms (17).
The presence of coronary artery lesions was defined as coronary artery diameter ≥2.5 mm in patients aged 0-3 years old, ≥3.0 mm in patients aged 3-9 years old and ≥3.5 mm in patients aged older than 9 years (18).
All demographic characteristics, clinical features, imaging data and laboratory data prior to the initial use of IVIG were collected. The demographic characteristics included age (month) and sex; clinical features included days of illness at the initial treatment, maximum body temperature and cervical lymphadenopathy, conjunctival hyperemia, lip changes, rash, perianal changes, and edema of the hands and feet; imaging data prior to the initial use of IVIG included the presence of CALs. The laboratory data included blood cell analysis: neutrophil count, white blood cell (WBC), lymphocyte count, platelet count (PLT), hemoglobin (HB), percentage of neutrophils, neutrophil-to-lymphocyte ratio (NLR), platelet-to-lymphocyte ratio (PLR); biochemical examination: lactic dehydrogenase (LDH), total bilirubin (TBIL), globulin, albumin, alanine transaminase (ALT), gammaglutamyl transpeptidase (GGT), aspartate aminotransferase (AST), procalcitonin (PCT), serum sodium, serum potassium, C-reactive protein (CRP), and erythrocyte sedimentation rate (ESR) ( Table 2).

Statistical Analysis
Statistical analysis was performed by SPSS version 25.0. We used frequency (percentage) to describe categorical variables, and a χ 2 -test was used to analyze the difference between IVIG-responsive and IVIG-resistant groups. Since all continuous variables were non-normally distributed, the median (interquartile range) was used to present and compare results by the Mann-Whitney U test. P < 0.05 was considered statistically significant. The statistically significant variables in the univariate analysis were included in the logistic regression analysis to further screen out independent risk factors for IVIG resistance, and the selected risk factors were incorporated into the machine learning models to establish the IVIG-resistant KD prediction model. Sensitivity, specificity, and area under the curve (AUC) were used to evaluate the prediction performance of the models.

Machine Learning Algorithm Prediction Model Construction
We used the computer-generated random number method to divide 1,398 KD patients into a training set (979 cases) and a test set (419 cases) at a ratio of 7:3. The training set was used for model training, and the test set was used to verify the generalization ability of the models. Using the "univariate analysis + logistic regression analysis" method to screen variables from the training set, we built a logistic regression nomogram, support vector machine (SVM), XGBoost, and LightGBM machine learning algorithm prediction models. The rms package of R language (R version 3.6.3) was used to build a logistic regression nomogram.
The Scikit-learn package was adopted in the Python 3.6.5 environment to implement the SVM and XGBoost prediction models. SVM training process: the penalty term coefficient C was set to 0.5, the original dimension was expanded by a linear kernel function, and the hyperplane that maximizes the separation of the two groups in the high-dimensional space was found to obtain the best prediction model. XGBoost model training process: the learning rate was set to 0.2, and the depth of the tree was set to 3. The Python language LightGBM package was used to build the LightGBM prediction model. The learning rate was set to 0.02, and the maximum depth of the tree was 6 (The training process of model parameters has been uploaded in the form of attachment). Due to the severe imbalance between IVIG-responsive and IVIG-resistant groups, we used the class_weight parameter to adjust the weight of the positive and negative samples in the classifier (19) to increase the importance of a small number of sample categories and improve the classification performance of the model.

Analysis of Risk Factors for IVIG Resistance in the Training set
Among the 979 patients in the training set, 111 patients were in the IVIG-resistant group, and 868 patients were in the IVIGresponsive group. According to univariate analysis ( Table 3), 14 variables were significantly different between the two groups (P < 0.05), including the days of illness at the initial treatment, lymphocyte count, PLT, HB, percentage of neutrophils, NLR, TBIL, globulin, albumin, ALT, GGT, AST, PCT, and serum sodium. There was no significant difference in the remaining 17 variables (P > 0.05). The 14 variables with statistical significance in the univariate analysis were used as independent variables, and the occurrence of IVIG resistance was used as the dependent variable (Yes = 1, No = 0). Logistic regression analysis was performed (α in = 0.05, α out = 0.10). The results showed that the four variables TBIL, PCT, ALT, and PLT were statistically significant (P < 0.05) and were independent risk factors for IVIG-resistant KD patients ( Table 4).

Logistic Regression Nomogram and Machine Learning Model Construction
Logistic regression analysis results were used to screen variables, including TBIL, PCT, ALT, and PLT, to construct a logistic regression nomogram, SVM, XGBoost, and LightGBM machine learning prediction models. The results of the logistic regression nomogram prediction model are shown in Figure 1, and Figure 2 shows the consistency analysis of the logistic regression nomogram prediction model. The results show that the model has good stability. Figure 3 shows the ROC curves and AUC results of the four prediction models. Table 5 shows the specific evaluation indicators of the model. The LightGBM model had the highest sensitivity and AUC value, with 0.702 and 0.874, respectively.   The model with the highest specificity was the XGBoost model (specificity = 0.967). Combining the three indicators of sensitivity, specificity, and AUC value, the LightGBM model achieved a significantly better predictive performance.

Comparison With the Previous Scoring Systems
Compared with the previous IVIG-resistant scoring systems, the AUC value of the LightGBM model (AUC = 0.874) was higher than those of Egami (AUC = 0.581), Kobayashi (AUC = 0.524), Sano (AUC = 0.519), Fu (AUC = 0.578), and Formosa (AUC = 0.575). The Formosa scoring system had the highest sensitivity (sensitivity = 0.762) but low specificity (specificity = 0.393). The model with the highest specificity was the Egami scoring system (specificity = 0.931). With comprehensive sensitivity, specificity, and AUC values, the LightGBM model had the highest predictive performance ( Table 6).

DISCUSSION
The main complication of KD is coronary artery lesions, which have gradually replaced rheumatic fever and become the main cause of childhood acquired heart disease. Currently, the treatment of KD mainly depends on high-dose IVIG; however, IVIG-resistant KD is not sensitive to IVIG, and additional treatment cannot quickly and effectively reduce vascular inflammation after the initial use of IVIG. Therefore, there is an urgent need to build a prediction model for IVIG-resistant KD with high predictive ability for specific populations in Chongqing areas. Here, we reviewed 1,398 KD patients in 7 medical institutions affiliated with Chongqing Medical University. The logistic regression method with strong interpretability was used to screen variables, and four risk factors for IVIG resistance were screened out, including TBIL, PCT, ALT, and PLT. A variety of machine learning algorithms have been applied to build prediction models with high complexity. These models performed well in sensitivity, specificity and AUC and appear to be superior to previous models when applied to the Chongqing KD population.
In the past decade, logistic regression has been the first choice to build IVIG prediction models due to its simple model and strong interpretability. The Kobayashi score, Egami score, Formosa score and most other predictive scores were based on logistic regression. When the classification boundary is linear, the logistic regression model has a better prediction effect (20), but when processing high-latitude, large-volume data, the effect is often not good. With the development of artificial intelligence algorithms, an increasing number of machine learning algorithms have been developed, including traditional K-nearest neighbors, decision trees, SVM algorithms, and the emerging XGBoost and LightGBM models. These machine learning models offer excellent performance in processing highlatitude and large-volume data. An increasing number of scholars are applying machine learning algorithms to clinical research (21)(22)(23). In this study, logistic regression nomograms and SVM, XGBoost, and LightGBM algorithms were used to construct IVIG resistance prediction models. Among the constructed models, the LightGBM model had the best comprehensive predictive performance, with a sensitivity of 0.702, a specificity of 0.903, and an AUC value of 0.874. The LightGBM algorithm is a gradient boosting framework based on the decision tree algorithm released by Microsoft Research Asia in 2017. It uses the leafwise algorithm with depth restrictions and discards the levelwise algorithm used by XGBoost. More errors can be reduced with the same number of splits, so the LightGBM algorithm achieves better accuracy than other models (24).
In terms of variable screening, TBIL was included in our study as a high-risk factor for IVIG resistance. This is consistent with the findings of Sano et al. (8), who used total bilirubin ≥0.9 mg/dL as a predictor of IVIG resistance. The increase in TBIL in KD patients in the IVIG-resistant group may be related to acute hepatic vascular inflammation leading to hepatic vascular congestion and hepatic vascular inflammation leading to liver cell damage. Several large-scale cross-sectional studies have revealed a strong correlation between the presence of cardiovascular disease and the concentration of serum total bilirubin. Schwertner et al. (25) observed for the first time that there was a significant negative correlation between serum total bilirubin and the prevalence of coronary ischemic disease. Subsequently, experiments by Hopkins (26) and Breimer (27) successively confirmed this conclusion. Egami (6) hypothesizes that ALT≥80 IU/L is an important risk factor for IVIG resistance. Liu (28) also showed that KD patients with higher ALT levels are more likely to develop IVIG resistance. This is consistent with our research results. PCT is a common serum marker of inflammation, and it increases in severe bacterial infections. The latest research by some scholars has shown that PCT concentrations below 0.25 ng/ml may help distinguish KD from sepsis, and PCT concentrations of 0.25-0.50 ng/ml may help predict IVIG resistance (29). PLTs are the first responders to vascular injury and endothelial rupture, but studies have shown that PLTs are also inflammatory effector cells, with various activities ranging from acute inflammation to adaptive immunity (30). There are a large number of receptors on the surface of PLTs, and these receptors often interact with other cells (WBCs and endothelial cells). In vitro experiments showed that human neutrophils partially rely on platelets to enhance fibrin deposition in the bloodstream (31). In this study, thrombocytopenia in KD patients in the IVIG-resistant group may be related to the continuous consumption of platelets due to coronary artery lesions.
In summary, this study was based on the interpretability of logistic regression to screen independent risk variables and construct logistic regression nomogram, SVM, XGBoost, and LightGBM prediction models for IVIG resistance. The new models exhibited better prediction efficiency than the previous models and can be widely used in theory, but this study also presents some limitations. First, this study was a retrospective analysis, and the results need to be further verified by prospective studies. Second, some data items were missing, which might result in bias in the statistical analysis. In future studies, we will conduct prospective studies and collect more data to improve the screening of IVIG-resistant risk factors and model construction to further evaluate the effectiveness of new models.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

AUTHOR CONTRIBUTIONS
JL and XH designed the study and analyzed the data. JL, JZ, HH, YW, ZZ, and YM acquired the data. JL and XH drafted the manuscript. JZ, HH, YW, ZZ, and YM read and revised the manuscript. All authors contributed to all study data, write and approved the final version of the manuscript. All the financial expenditure of this study comes from the above programs.