An integrated machine learning predictive scheme for longitudinal laboratory data to evaluate the factors determining renal function changes in patients with different chronic kidney disease stages

Background and objectives Chronic kidney disease (CKD) is a global health concern. This study aims to identify key factors associated with renal function changes using the proposed machine learning and important variable selection (ML&IVS) scheme on longitudinal laboratory data. The goal is to predict changes in the estimated glomerular filtration rate (eGFR) in a cohort of patients with CKD stages 3–5. Design A retrospective cohort study. Setting and participants A total of 710 outpatients who presented with stable nondialysis-dependent CKD stages 3–5 at the Shin-Kong Wu Ho-Su Memorial Hospital Medical Center from 2016 to 2021. Methods This study analyzed trimonthly laboratory data including 47 indicators. The proposed scheme used stochastic gradient boosting, multivariate adaptive regression splines, random forest, eXtreme gradient boosting, and light gradient boosting machine algorithms to evaluate the important factors for predicting the results of the fourth eGFR examination, especially in patients with CKD stage 3 and those with CKD stages 4–5, with or without diabetes mellitus (DM). Main outcome measurement Subsequent eGFR level after three consecutive laboratory data assessments. Results Our ML&IVS scheme demonstrated superior predictive capabilities and identified significant factors contributing to renal function changes in various CKD groups. The latest levels of eGFR, blood urea nitrogen (BUN), proteinuria, sodium, and systolic blood pressure as well as mean levels of eGFR, BUN, proteinuria, and triglyceride were the top 10 significantly important factors for predicting the subsequent eGFR level in patients with CKD stages 3–5. In individuals with DM, the latest levels of BUN and proteinuria, mean levels of phosphate and proteinuria, and variations in diastolic blood pressure levels emerged as important factors for predicting the decline of renal function. In individuals without DM, all phosphate patterns and latest albumin levels were found to be key factors in the advanced CKD group. Moreover, proteinuria was identified as an important factor in the CKD stage 3 group without DM and CKD stages 4–5 group with DM. Conclusion The proposed scheme highlighted factors associated with renal function changes in different CKD conditions, offering valuable insights to physicians for raising awareness about renal function changes.


Introduction
Chronic kidney disease (CKD), characterized by decreased glomerular filtration rate, is a global health concern with a high prevalence rate (10-15%); moreover, this disease is highly associated with morbidity and mortality, leading to financial and medical burdens (1)(2)(3).The gradual loss of kidney function in patients with CKD may lead to end-stage kidney disease (ESKD) that requires kidney replacement therapy.According to the latest annual report of the United States Renal Data System, the average annual increase in the ESKD prevalence worldwide from 2003 to 2016 ranged from 0.1 to 109 per million population (4), thus placing a greater burden on the health insurance system of many countries.The ESKD prevalence in Taiwan is high (4), with a substantial increase observed between 2010 and 2018 (5).
Timely intervention for delaying the progression of CKD to ESKD may not only improve the quality of life of patients but also reduce the associated morbidity and mortality (6, 7).As the exacerbation of renal function in patients with CKD is usually silent, it is clinically important to develop an accurate prediction model for the risk of CKD progression.Such a model is expected to facilitate physicians in making personalized treatment decisions, thereby improving the overall prognosis.Various statistical models have been developed to predict the risk of ESKD based on variables such as age, sex, blood pressure, comorbidities, laboratory data, and most commonly, the estimated glomerular filtration rate (eGFR) and proteinuria level (8).Among them, the most popular statistical model is the four-variable kidney failure risk equation (KFRE) based on age, sex, eGFR, and urine albumin-to-creatine ratio (9).Further, an eight-variable equation based on KFRE (which further included serum albumin, bicarbonate, calcium, and phosphate levels) was proposed to provide a more accurate prediction (10).The traditional statistical methods are often based on predefined hypotheses and assumptions.Researchers formulate hypotheses and test them using statistical methods.Moreover, these methods often focus on drawing inferences and conclusions about a population based on a sample.These methods aim to provide insights into causal relationships and generalizability (11).However, the traditional statistical methods have several limitations in effectively dealing with the challenges posed by big data and complex data structures.
Machine learning (ML) methods excel in analyzing unstructured data and complex patterns, whereas traditional statistical methods often require human intervention and expertise in model selection, hypothesis formulation, and result interpretation (11).In the big data era, several applications of ML, which is a subset of artificial intelligence (AI), have emerged in health informatics (12), allowing computers to perform a specific task without direct instruction (13).In contrast to theorydriven formula that requires a predefined hypothesis based on prior knowledge, ML models typically follow a data-driven approach that allows the model to learn from experience alone (14).Therefore, compared with the traditional statistical methods, ML models may demonstrate better performance in predicting a determined outcome, as they have no strict assumptions when modeling (15)(16)(17)(18)(19)(20)(21).The utilization of ML algorithms in CKD is a promising research topic that aims at assisting healthcare professionals in diagnosing and managing patients with CKD via computer-aided decision support systems for the early identification of critical events such as ESKD or eGFR doubling (22)(23)(24)(25).
To date, only a few studies have used ML methods in CKD populations for identifying metabolomic signatures of pediatric CKD etiology (26), developing a lifestyle scoring system for CKD prevention (27), using retinal images and clinical metadata to predict eGFR and CKD stage (28), and predicting incident ESKD in patients with diabetes mellitus (DM) and CKD (29).However, previous ML prediction models have primarily integrated baseline laboratory data and clinical information, and many studies have focused on predicting ESKD rather than eGFR changes (25).Furthermore, studies on CKD-related risk factor screening or analysis have predominantly relied on a single model without considering hybrid approaches, especially when employing ML methods (26,27,29,30).According to a previous study, the mean annual eGFR decline in healthy individuals was estimated to be 0.97 ± 0.02 mL/min/1.73m 2 (31).However, even for individuals with the same underlying comorbidities or extent of kidney function impairment, the eGFR decline could be highly variable.Early identification and management of patients with CKD based on longitudinal biochemical data are essential.
Therefore, this study aimed to identify significant factors that influence the prediction of eGFR changes using the ML and important variable selection (ML&IVS) scheme in a CKD cohort with longitudinal laboratory data.In the context of longitudinal data analysis, it is crucial to prioritize the examination of the rate and variation of biochemical data rather than solely relying on baseline data.This study employed five effective ML algorithms, namely, stochastic gradient boosting (SGB), multivariate adaptive regression splines (MARS), random forest (RF), eXtreme gradient boosting (XGBoost), and light gradient boosting machine (LightGBM).Using these algorithms, we developed an integrated multistage ML-based predictive scheme for all four subgroups according to eGFR and presence of DM to predict eGFR changes and subsequently evaluate and integrate relatively important risk factors.

Dataset
This retrospective study included 710 patients with nondialysisdependent CKD who were recruited from outpatient nephrology clinics of the Shin-Kong Wu Ho-Su Memorial Hospital Medical Center for a prospective cohort study from 2016 to 2021.The inclusion criteria were as follows: patients aged ≥20 years, those who sustained (≥3 months) a decrease in eGFR of ≤60 mL/min/1.73m 2 based on the four-variable modification of diet in renal disease study equation (32), and those who were regularly followed up in our CKD multidisciplinary care program (33).Patients who did not visit the nephrological outpatient department for ≥4 months and had incomplete data were excluded.In our CKD multidisciplinary care program, patients were regularly followed up in the nephrological outpatient department every 3 months.
This study was conducted according to the guidelines of the Declaration of Helsinki and was approved by the Institutional Review Board of the Shin-Kong Wu Ho-Su Memorial Hospital, Taipei, Taiwan (IRB no.20200901R).Informed consent was waived because our study was based on a medical chart review.Patient information was anonymized and de-identified before analysis.

Definition of longitudinal variables
This study aimed to predict eGFR changes and the corresponding relationship between risk factors in the fourth examination of each patient; thus, the results of the first three examinations of each patient were utilized as independent variables.As the results of the first three examinations were collected from each patient, the independent variables used in this study could be regarded as longitudinal predictor variables.
The definitions of the longitudinal variables used in this study are presented in Table 1.As shown, V i,t represents the t th time examination result of i th variable [e.g., V 1,2 is the systolic blood pressure (SBP) result of the second examination].A total of 17 variables were utilized in this study.Moreover, as all patients had previous records of their first three examinations, the variables used in this study could be further defined as Equation (1).
where 1 ≤ i ≤ 17; 1 ≤ t ≤ 3.Then, this study utilized three approaches for generating more predictor variables that can be used to construct the ML predictive models.These three approaches, namely, "closest, " "mean, " and "standard deviation, " could provide various data from the independent variables.The "closest" approach generated the predictor variable using the result of the latest examination, which is the third examination result (V i,3 ) in this study.The predictor variable generated using the "closest" approach (V i C) was defined as Equation (2).For example, V 1 C is the SBP result of the third examination (V 1,3 ) and can be written as SBP(C).
The "mean" approach generated the predictor variable by calculating the mean of the results of the first three examinations ).The predictor variable generated using the "mean" approach (V i M) was defined as Equation (3).For demonstration, V 1 M was constructed by calculating the mean of the results of the first ( can also be written as SBP(M).
Similar to the concept of the "mean" approach, the "standard deviation" approach generated the predictor variable (V i S) by calculating the standard deviation of the results of the first three examinations (V i,1 , V i,2 , V i,3 ), as shown in Equation (4).For example, V 1 S is the standard deviation of the results of the first three SBP examinations ( ) and can be written as SBP(S).
Frontiers in Medicine 03 frontiersin.org The above mentioned 3 approaches were utilized for all 17 independent variables to generate the overall predictor variables used in this study.In addition, as height and weight do not change drastically between examinations, this study only used the "closest" approach for these variables.Thus, 47 predictor variables were generated and utilized for constructing the ML predictive models.The descriptive statistical data of 47 predictor variables and the target variable (eGFR of the fourth examination) are presented in Table 2.

Proposed scheme
This study proposed an integrated multistage predictive scheme known as ML&IVS based on five ML methods, including SGB, MARS, RF, XGBoost, and LightGBM, for predicting eGFR changes and subsequently identifying important risk factors.The five ML algorithms have been successfully used in various medical/healthcare applications (16,19,21,23,24,(34)(35)(36)(37).The overall process of ML&IVS scheme is shown in Figure 1.
In this process, first, 4-year health examination data of patients with CKD were collected; the dataset is described in Section "2.1.Dataset."Second, 47 longitudinal risk variables were defined, and the patients were categorized into five subgroups: Group 1 (patients with CKD stage 3 and DM), Group 2 (patients with CKD stage 3 without DM), Group 3 (patients with CKD stages 4-5 and DM), Group 4 (patients with CKD stages 4 and 5 without DM), and Group 5 (overall patients).The definitions of variables are presented in Section "2.2.Definition of longitudinal variables."Third, RF, SGB, MARS, XGBoost, and LightGBM were used to construct predictive models for all five groups.As multiple linear regression (MLR) is a commonly used statistical method in medical/healthcare applications, it was used in this study as a bench method.RF is an ensemble learning decision tree algorithm that combines bootstrap resampling and bagging (38).The RF principle entails random generation of many different and unpruned classification and regression tree (CART) decision trees, in which the decrease in Gini impurity is regarded as the splitting criterion, and all generated trees are combined into a forest.Then, all trees in the forest are averaged or voted to generate output probabilities and construct the final robust model.SGB is a treebased gradient boosting learning algorithm that combines both bagging and boosting techniques to minimize the loss function and thereby solve the overfitting problem of the traditional decision trees (39,40).SGB comprises many stochastic weak learners of decision trees that are sequentially generated through multiple iterations, in which each tree focuses on correcting or explaining the errors of the tree generated in the previous iteration.In other words, the residual of the previous iteration tree is used as the input for the newly generated tree.This iterative process is repeated until a convergence condition or stopping criterion is reached for the maximum number of iterations.Finally, the cumulative results of several trees are used to construct the final robust model.MARS is a nonlinear spline regression method and nonparametric form of the regression analysis algorithm (41).MARS uses multiple piecewise linear segments (splines) with different gradients.It considers each sample as a knot and divides it into several sections for the successive linear regression analysis of data within each section.For

Predictor variables
Abb. Name Frontiers in Medicine 04 frontiersin.orgProposed machine learning predictive and important variable selection scheme (ML&IVS).
determining knots, a forward algorithm is used to select all possible basic functions and their corresponding knots, and a backward algorithm is used to eliminate all basic functions to generate the best combinations of existing knots.
XGBoost is a gradient boosting technology based on an SGB-optimized extension (42).It trains many weak models sequentially to ensemble them using the gradient boosting method of outputs, which can achieve a promising prediction performance.In XGBoost, Taylor binomial expansion is used to approximate the objective function and arbitrary differentiable loss functions, thereby accelerating the convergence process of model construction.Then, XGBoost employs a regularized boosting technique to penalize the complexity of the model and correct overfitting, thus increasing the accuracy of the model.LightGBM is a decision tree-based distributed gradient boosting framework that uses the decision tree technique with improved histograms.
To improve the accuracy of tree models, LightGBM uses gradientbased one-sided sampling and a unique feature bundling algorithm to fit a loss function with negative gradients and learn the residual approximation of decision trees in iterations (43).
Fourth, after constructing effective ML models for all five groups, the relative importance values of each risk factor can be obtained via ML algorithms.In this study, the most important risk factor had an importance value of 100, whereas the least important risk factor had an importance value of 0. Values can be repeated, i.e., two or more variables can have similar variable importance.Because different ML algorithms have different calculus architectures and selection features, the five ML algorithms in this study may generate different variable importance values for a single risk factor.Within the same group, a single, robust, and complete variable importance value can be generated for each risk factor to facilitate subsequent comparison of variable rankings and identification of important risk factors.A single pooled value of variable importance was generated based on the average value of variable importance derived from the five ML models.
Fifth, the important variables among groups were compared, and their differences were discussed.Sixth, the conclusions of this study were presented based on the abovementioned findings.
For constructing each model, nested cross-validation (Nest-CV) was utilized.Nest-CV is a variation of the CV family.Under the structure of Nest-CV, two loops are required (inner and outer loops).The inner loop is used for hyperparameter tuning (which is equal to k-fold CV), whereas the outer loop is used for model evaluation, with the optimal hyperparameter set in the inner loop.Nest-CV is commonly utilized, and several studies have demonstrated that Nest-CV can overcome the problem of overfitting effectively (44)(45)(46)(47).
During model construction under Nest-CV structure, the outer loop first randomly splits the dataset into several folds (10 folds in this study).Then, for each iteration, one fold of the data from the outer loop is used for testing, whereas the remaining folds are used for training.Next, the training data from the inner loop are used for hyperparameter tuning with 10-fold CV.In the 10-fold CV approach, the training dataset was further randomly divided into 10 folds, wherein 9 folds were used to construct the model with a different set of hyperparameters, and 1 fold was used for validation.All 10 folds were used for validation at least once, and all possible combinations of the hyperparameters were investigated using a grid search.The hyperparameter set during validation with the lowest root mean square error (RMSE) was regarded as the most optimized set.After determining the set and training the model using the set, the test data from the outer loop were used for evaluation, which involves the completion of one iteration.Entire Nest-CV was completed when each fold of the data from the outer loop was used for testing at least once.After constructing the ML models, the variable importance values can be extracted from these models.As 10-fold Nest-CV is utilized, the extracted variable importance ranking from each ML model will have 10 scores for each variable.Therefore, to obtain the final corresponding variable importance values, the approach of averaging the importance scores for each variable was employed.

Empirical study
The dataset was divided into five subgroups, i.e., Group 1 (patients with CKD stage 3 and DM), Group 2 (patients with CKD stage 3 without DM), Group 3 (patients with CKD stages 4-5 and DM), Group 4 (patients with CKD stages 4 and 5 without DM), and Group 5 (overall patients).All groups followed the same modeling process for predicting eGFR in the fourth examination.The performance of the methods used in all five groups is presented in Table 4.
As shown in Table 4, the error of the metrics and their corresponding standard deviation (SD) are presented; percentage values after n in the first column indicate the corresponding proportion of the group.In Group 1, ML methods showed better performance than the MLR method.Among these ML methods, SGB showed the best performance with the following values: SMAPE, 0.097; MAPE, 0.098; RAE, 0.672; and RMSE, 4.922.RF had similar RRSE as SGB but lower SD than SGB.
In Group 4, RF showed the best performance with the following values: SMAPE, 0.159; MAPE, 0.178; RAE, 0.406; RRSE, 0.464; and RMSE, 3.333.Finally, in Group 5, the CKD stage 3-5 dataset was considered the full dataset without subgrouping.ML methods showed better performance than the MLR method.Among them, MARS showed the best performance with the following values: MAPE, 0.147; RAE, 0.350; RRSE, 0.397; and RMSE, 5.281.Overall, the performance of the five ML methods used in each subgroup had low SD, indicating that the ML usage in this study was reasonable and robust.Furthermore, different ML methods have different mechanisms for generating data regarding various risk factors, and each method visualizes data with a distinct perspective.In other words, ML methods can generate valuable information for supporting decision making with various perspectives; thus, the information generated from all ML methods can be considered.To accurately rank the risk factors using ML methods in different subgroups, this study revealed the top 10 risk factors in each subgroup (Figure 2).
As shown in Figure 2, in Group 1, risk factors were ranked by the five ML methods using the importance scores.Then, we calculated the average importance score of each risk factor to determine the top 10 important risk factors.Thus, in Group 1, the rank 1 risk factor was V 17 C, rank 2 was V 17 M, and rank 3 was V 17 S. Similarly, in Groups 2-5, the top 10 important risk factors were calculated.Further, to easily compare the important risk factors between subgroups, the results from Figure 2 were further displayed in Table 5 and the distribution plots of the top 10 important features of each subgroups are shown in Supplementary Figures 1-5.
Table 5 presents the differences in the top 10 important risk factors among all 5 groups.eGFR(C) and eGFR (M) were the first (rank 1) and second (rank 2) important risk factors, respectively, in all groups.Interestingly, BUN(C) was the third important risk factor in all groups, except for Group 1.In general, the ranking of risk factors in each subgroup was different.For example, UA(C) only appeared in Group 1 and K(S) only appeared in Group 2. Variable importance score generated from the five algorithms for each risk factor in the five groups.

Discussion
This study utilized longitudinal electronic health records to identify the risk factors associated with the prediction of eGFR changes in different CKD groups.This analysis considered the current values, mean values, and variation of biochemical data.Our proposed ML&IVS scheme yielded valuable results that can aid clinicians in effectively managing CKD progression and providing preventive measures in the CKD multidisciplinary care program.In addition to identifying important factors in CKD progression, our ML scheme, with some necessary modifications, can be seamlessly integrated into an electronic system, enabling the early identification of CKD progression.To the best of our knowledge, this study represents the first attempt to utilize ML methods to predict short-term eGFR changes in patients with CKD.The innovative application of ML in this context holds great potential for advancing the field and improving patient care.
ML has demonstrated promising performance in the nephrological field, including kidney function prediction via ultrasonography (54), acute kidney injury prediction in critical care (55,56), specific pattern identification on renal pathology slides (57, 58), optimal dialysis prescription (59, 60), calculation of further eGFRs (61), mortality risk prediction in patients undergoing dialysis (62), and ESKD prediction based on clinical data (63)(64)(65).In this study, five ML methods were adopted to obtain the 10 most important factors for predicting eGFR changes in different CKD groups.The latest eGFR and mean eGFR were the two most important factors among all subgroups.Moreover, the SD of eGFR was important in all subgroups.Under the assumption of the stochastic process, the latest value is always important to predict the next value.The mean value represents long-term trends, whereas the SD indicates the severity of fluctuation in the short term.
In this study, we employed the widely used Pearson correlation method to identify the top 10 factors with the highest correlation coefficients within each group (Supplementary Table 1).However, these high correlation coefficients appeared to exhibit less variability between groups.This raised concerns about their ability to provide meaningful distinction between the groups.Nevertheless, promising results were obtained through our proposed ML&IVS scheme.We revealed that the crucial factors identified via ML&IVS scheme displayed greater differences between groups than those identified solely through the Pearson correlation method.This finding indicated the effectiveness of our scheme in providing more informative and discriminative features that can potentially improve the understanding and characterization of different groups in this study.
Numerous studies have attempted to identify the risk factors associated with clinical CKD progression (66), and most of them have adopted traditional statistical methods under a theory-driven assumption.In this study, we used ML methods to determine important laboratory factors for short-term eGFR prediction.In all patients with CKD stages 3-5, the latest and mean phosphate levels, SBP variation, latest albumin levels, hemoglobin variation, and mean uric acid levels were important factors in addition to the indicators of current kidney function (eGFR and BUN).Such findings were consistent with those of previous studies reporting that serum phosphate (67), blood pressure (68), lower serum albumin (69), hemoglobin (70), and uric acid (71) levels were associated with CKD progression.
DM is the leading cause of CKD, and approximately 20-30% of patients with type 2 DM have moderate-to-severe CKD (eGFR of <60 mL/min/1.73m 2 ).Such kidney damage is associated with the accumulation of uremic toxins, inflammatory factors, and oxidative stress (72).Therefore, patients with DM have different risk profiles of CKD progression compared with those without DM (73); moreover, proteinuria is usually observed in patients with DM (74) and is well-known to be robustly associated with CKD progression (75).However, in our study, proteinuria was an important factor only in patients with moderate CKD without DM and those with advanced CKD with DM.Such discrepancy may be due to our prediction of short-term eGFR changes.The serum albumin level might be an alternative to proteinuria severity for predicting CKD stage 3 with DM.
The role of uric acid in CKD progression remains controversial.Our results revealed that serum uric acid is an important factor for predicting CKD stage 3 with/without DM.High uric acid levels may cause glomerular injury, tubulointerstitial fibrosis, atherosclerosis, and vascular injuries (76).Some observational studies have revealed that serum uric acid level is associated with renal function impairment (77)(78)(79).Moreover, a recent study reported that even uric acid levels under therapeutic criteria may increase the risk of CKD stage transition (74).However, three randomized controlled trials have reported that lowering uric acid levels is not beneficial in preserving renal function (80)(81)(82).This discrepancy is attributed to hyperuricemia resulting from renal function impairment.Therefore, asymptotic hyperuricemia does not require medical treatment but needs lifestyle modification.
Phosphorus is an important mineral for maintaining cell structure and energy.It is mainly found intracellularly (70%).Approximately 29% of phosphorus resides in the bones and <1% circulates in the serum (83).In CKD, kidneys cannot excrete phosphorus, resulting in hyperphosphatemia and consequent renal osteodystrophy (84), which was more significant in advanced CKD.A meta-analysis reported an independent association between serum phosphorus level and kidney failure in patients with nondialysis-dependent CKD (67), revealing that higher phosphorus levels might lead to a steeper decline in renal function.Moreover, fibroblast growth factor 23 (FGF23) was secreted by osteocytes in bones, and its levels increased at an early stage (starting at an eGFR of <90 mL/min/1.73m 2 ).FGF23 participates in serum phosphate hemostasis by decreasing phosphorus absorption from the alimentary tract (85).Therefore, hyperphosphatemia may increase FGF23 levels, leading to anemia, cardiovascular disease, and eventually death (86,87), resulting in a poor renal progression.
This study has some limitations.The dataset used in this study was from a single medical center, which may limit the generalizability of our findings.Therefore, federated learning, which refers to collaborative ML without centralized training data, is necessary to include more data from multiple centers in future studies (88).In addition, the prescribed medications were not collected.Some medicines may have affected the renal function progression.Finally, this cohort consisted of approximately 700 patients, which might have affected the model performance.Next, this study has some advantages.This study used longitudinal data to predict eGFR changes, which can provide a causal inference.The proposed ML&IVS scheme, which integrated the results of risk factor identification and information from five well-known ML methods, could provide more robust and useful information to support our results.

Conclusion
When longitudinal data are used to predict eGFR changes in patients with CKD with or without DM, the proposed ML&IVS scheme can effectively integrate the most significant risk factors from each model, resulting in more robust and comprehensive identification of important risk factors for predicting eGFR changes.It is crucial to increase awareness regarding eGFR changes, particularly in government health-promotion initiatives within the multidisciplinary CKD care program.This study provides valuable insights for initiating further discussions and follow-up research in this field.These findings contribute to our understanding of the factors influencing eGFR changes and can guide future investigations on the early identification and management of CKD progression.

TABLE 1
Definition of longitudinal variables.

TABLE 2
Descriptive statistical data of 47 predictor variables and the target variable.

TABLE 3
Equations of performance metrics.
n i=1 (y i − ŷi ) 2where ŷi and yi represent predicted and actual values, respectively; n, number of instances.

TABLE 4
Performance of the MLR and five ML methods in all five groups.

TABLE 5
Ranking of the top 10 important variables of the five groups.