Development of a Risk Model for Predicting Microalbuminuria in the Chinese Population Using Machine Learning Algorithms

Objective Microalbuminuria (MAU) occurs due to universal endothelial damage, which is strongly associated with kidney disease, stroke, myocardial infarction, and coronary artery disease. Screening patients at high risk for MAU may aid in the early identification of individuals with an increased risk of cardiovascular events and mortality. Hence, the present study aimed to establish a risk model for MAU by applying machine learning algorithms. Methods This cross-sectional study included 3,294 participants ranging in age from 16 to 93 years. R software was used to analyze missing values and to perform multiple imputation. The observed population was divided into a training set and a validation set according to a ratio of 7:3. The first risk model was constructed using the prepared data, following which variables with P <0.1 were extracted to build the second risk model. The second-stage model was then analyzed using a chi-square test, in which a P ≥ 0.05 was considered to indicate no difference in the fit of the models. Variables with P <0.05 in the second-stage model were considered important features related to the prevalence of MAU. A confusion matrix and calibration curve were used to evaluate the validity and reliability of the model. A series of risk prediction scores were established based on machine learning algorithms. Results Systolic blood pressure (SBP), diastolic blood pressure (DBP), fasting blood glucose (FBG), triglyceride (TG) levels, sex, age, and smoking were identified as predictors of MAU prevalence. Verification using a chi-square test, confusion matrix, and calibration curve indicated that the risk of MAU could be predicted based on the risk score. Conclusion Based on the ability of our machine learning algorithm to establish an effective risk score, we propose that comprehensive assessments of SBP, DBP, FBG, TG, gender, age, and smoking should be included in the screening process for MAU.


INTRODUCTION
Microalbuminuria (MAU) is defined as a urinary albumin excretion of 20-200 mg/L in a spot urine test or 30-300 mg in a 24-h urine collection test (1). The presence of MAU represents an early manifestation of general endothelial damage, which can occur secondary to diabetes, hypertension, and coronary heart diseases (2,3). Research has demonstrated that MAU is closely associated with stroke, myocardial infarction, coronary artery disease, and allcause mortality (4). Several studies have also indicated that MAU is predictive of vascular disease, diastolic dysfunction, congestive heart failure, and hypertension (5)(6)(7). Hence, clinical screening and early identification of MAU remains especially important.
Advancements in proteomics technology such as protein separation, biological mass spectrometry, and bioinformatics have decreased the difficulty of examining proteome expression (8). Despite these advancements, there are still many drawbacks in the detection of urine albumin (8). The gold standard in chronic kidney disease (CKD) screening is the 24-h urine collection test; however, this method is difficult to implement on a large scale due to its inconvenience (2).
Therefore, in the present study, we aimed to establish and validate a risk model for early prediction of MAU using machine learning algorithms rather than the results of 24-h urine microalbumin tests. Application of risk scores derived using such a model would be more convenient for the monitoring and follow up of patients at higher risk for MAU.

Study Population
This cross-sectional study was performed between June 2011 and January 2012 and included participants randomly selected using a clustered sampling technique (9), with probabilities proportionate to the size of the population in each cluster. All participants were from Ningde City in Fujian province in southeast China. Overall, 3,294 Chinese (age: 16-93 years) participants who had no cognitive dysfunction and were not pregnant participated in the survey. MAU was defined as a urinary albumin excretion of 20-200 mg/L and was assessed using a spot urine test (1,4). The exclusion criteria were as follows: history of type-1 diabetes mellitus (DM), history of kidney disease or urinary albumin excretion ≥200 mg/L, and pregnancy. The study was performed in accordance with the Declaration of Helsinki and approved by the Ethics Committee of Fujian Provincial Hospital (approval No. K2009-12-020), and written informed consent was obtained from each participant. All investigators who were unaware of the study's aims or the characteristics of the participants received special training before the investigation. Figure 1 shows a flowchart describing patient selection.

Data Collection
All participants were required to complete a standard selfreported questionnaire comprising 10 questions addressing age, sex, personal and family medical history, smoking and drinking habits, and so on.
Weight, height, and waist circumference (WC) were measured to the nearest 0.1 kg and 0.1 cm, respectively, by experienced nurses, with patients wearing light clothing and no shoes. WC was measured at the middle point between the costal margin and iliac crest. Systolic and diastolic blood pressures (SBP and DBP) were both measured twice using a standard OMRON auto-electronic sphygmomanometer, and the mean of the two readings was used for analysis.
Blood samples were collected after an 8-to 12-h overnight fast and were stored at −20 • C until analysis. Participants were provided with oral and written instructions on the collection of urine samples and advised to postpone urine collection in case of urinary tract infection, fever, or menstruation, and to avoid heavy exercise as much as possible during the collection period. The blood samples were evaluated at the Laboratory of Ningde Municipal Hospital. Each blood sample was independently assessed by two qualified examiners. Blood glucose levels were determined using the glucose oxidase method (Sclavo, Siena, Italy). The automatic colorimetric method (Hitachi, Boehringer Mannheim) was used to determine total cholesterol (TC), total triglyceride (TG), and highdensity lipoprotein cholesterol (HDL-C) levels. Low-density lipoprotein cholesterol (LDL-C) levels were calculated using the Friedewald formula.

Statistical Analysis
All calculations were performed using R software (version 3.6.3 GUI 1.70 EI Capitan build, 7735).
The "vim" package for R software was used to analyze missing values and visualize the data. The "mice" package was used to perform multiple imputation on missing values (m = 5, method = "pmm, " maxit = 100, seed = 1,234). The imputed data and their distribution in the original dataset were analyzed and visualized using the "lattice" package.
The observed population was divided into a training set and a validation set according to a ratio of 7:3. The "glm" package was used to build the first risk model using the prepared data. Then, variables with P < 0.1 were extracted to build the second risk model, also using the "glm" package. A chi-square test of the second-stage model was performed using the "anova" package, and a P ≥ 0.05 was considered to indicate no difference in the fit of the model. A confusion matrix was used to verify the accuracy of the model, and a calibration curve was constructed using the "calibrate" package. Values of x closer to y in the calibration curve were considered to indicate better calibration of the model. Variables with P < 0.05 in the second-stage model were regarded as important features related to the prevalence of MAU. Graphical representations of the results were drawn using the "forestplot" package, and the risk score was established using a nomogram.

Participant Characteristics
The enrolled participants were categorized based on urinary albumin levels, gender, presence of hypertension/diabetes, and smoking and drinking habits. The study population comprised 3,294 study participants [men: 1,294 (39.3%); women: 2,000 (60.7%)]. The characteristics of the participants are shown in Tables 1, 2. A visual depiction of the distribution of these characteristics is shown in Figure 2.

Analysis of Missing Values
Twenty-eight observation indices were analyzed for missing values. The insulinogenic index represented the index with the most missing values, accounting for 10% (n = 330), followed by HOMA-IR and HMOA-β, which accounted for <10%. An analysis of trends in the distribution of missing values indicated that they were randomly distributed, conforming to the missing-at-random (MAR) assumption (Figures 3A,B). The "mice" package was used to perform multiple imputation on data with missing values (m = 5, method = "pmm, " maxit = 100, seeds = 1,234). The imputed data and their distribution in the original dataset are shown in Figure 3C.

Risk Model for MAU
The observed population was divided into a training set and a validation set according to a ratio of 7:3 (training set: 2,305; validation set: 989). The 2,305 cases in the training set were used to build the first predictive model, which is described in Table 3. Significant factors in this model (P <0.10) included SDP, DBP, Bg_0 min, TC, TG level, HDL level, gender, age, and smoking. Logistic model fitting was performed again after extracting the variables with P < 0.10. The second-stage model was then evaluated using a chi-square test, confusion matrix, and a calibration curve. The specificity of the model in the verification set reached as high as 0.9, with an accuracy of 0.63. The positive and negative predictive values were 0.55 and 0.65, respectively ( Figure 4A). In the calibration curve, values of x remained close to y, indicating good calibration in both the training and validation sets (Figures 4B,C). Based on a P < 0.05, important features related to the incidence of MAU in the second-stage model included mean SBP, mean DBP, FBG, TC, TG level, HDL, gender, age, and smoking ( Figure 5A).

Development of an MAU Risk Score
Given their significant relationship with MAU based on our analysis of the second-stage model, the following variables were used to develop the risk prediction system: mean SBP, mean DBP, FBG, TC, TGs, HDL, gender, age, and smoking. Figure 5B shows how total risk scores for MAU are calculated.

DISCUSSION
MAU is an early marker of diabetic kidney disease (DKD) (2), cardiovascular disease, and renal risk (1). Accounting for ∼50% of end-stage kidney disease (ESKD) cases in the developed world (16), DKD has a major effect on global healthcare costs and resources (2). Estimates indicate that the prevalence of MAU among patients with type-2 DM in the Asia-Pacific region ranges from 17.0 to 18.2%, while severe albuminuria and reduced estimated glomerular filtration rate (eGFR) are observed  (17,18). These statistics highlight the importance of screening, early detection, and prevention efforts to reduce the overall impact of MAU. Given that diabetic glomerulopathy can be only be diagnosed definitively via a kidney biopsy, few studies to date have investigated methods for predicting MAU (3), making it difficult to perform a detailed analysis of MAU risk (19). DKD may be present long before the patient develops traditional indications for a kidney biopsy (20). Careful screening and prediction using the risk score developed in our study may allow for early detection of MAU without the need for a kidney biopsy.
In contrast to previous findings, DM was not identified as an independent factor influencing MAU risk in the current study. This inconsistency may be related to the low proportion of patients with DM among our participants (11.9%). In the surveyed population, elevated FBG and PBG were more prevalent than DM, suggesting that diabetes had not been identified in some patients. However, the risk associated with elevated FBG was as high as 1.11 [odds ratio (OR): 1.11, 1.05-1.19], indicating that elevated blood glucose was still an independent risk factor for MAU.
Increased intraglomerular capillary pressure, which is related to systemic blood pressure as well as pre-and postglomerular resistance, is the most important determinant of MAU (21,22). Previous studies have reported that blood pressure is closely associated with albuminuria in patients with hypertension and in the general population (23)(24)(25). A study conducted among the Japanese population demonstrated that SBP exhibited an independent positive correlation with MAU (21). Another Japanese study indicated that both systolic hypertension and hyperglycemia were independent risk factors for MAU, in accordance with our findings (21). Saadi et al. have also observed that SBP and DBP are significantly higher patients with MAU than in the general population (26).
One study conducted in China reported that, for each 10 mg increment in 24-h urinary microalbumin excretion within the normal range, the odds of significantly elevated TG levels increased by 41% (24). Our analysis indicated that, when compared with the normal TG range, abnormally elevated TG levels increase the risk of MAU by a factor of 1.10 [odds ratio (OR): 1.10, 1.02-1.20].
Although Ge et al. (24) observed no significant difference in gender in 24-h urinary microalbumin excretion in a study of Chinese adults, our results are in contrast to these findings. Our analysis identified gender as an important feature influencing the incidence of MAU (OR: 1.47, 1.17-1.86). In Japan, the albumin/creatinine ratio is higher in women and older adults than in men and younger individuals, respectively, but this is not true for the albumin concentration (21). Our findings also indicated that, for each 10-year increment, the odds of TG elevation significantly increased by 9% (OR: 1.09, 1.03-1.18).
Several previous studies have reported that MAU is related to smoking (27)(28)(29) and obesity (30)(31)(32), while others have noted the influence of race and region on MAU prevalence (33)(34)(35). Our study suggests that smoking is indeed an important feature affecting the prevalence of MAU (OR: 1.35, 1.02-1.77), while no such relationship was observed for obesity. However, despite appropriate calibration of the model, the low incidence of obesity among our patients may have influenced our results.
To our knowledge, the current study is the first to establish a risk score for MAU using a large sample of patients, to establish such a model using multiple imputation to account for missing data, and to utilize chi-square and logistic fitting for double-verification of model quality. Nonetheless, the study also had some potential limitations, including relatively limited variations in race and region. Furthermore, this was a singlecenter and cross-sectional study, necessitating verification of our model in multicenter studies with long-term followup periods.
In conclusion, based on our analysis using machine learning algorithms, we propose that comprehensive assessments of SBP, DBP, FBG, TG, gender, age, and smoking be included in the screening process for MAU. The risk score established in the present study may allow clinicians and patients to initiate early interventions that can delay or prevent the development of MAU.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Fujian Provincial Hospital Ethics Committee. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
WL and SS performed the statistical analysis and wrote the first draft of the manuscript. HH, NW, and JW reviewed, edited, critically revised the manuscript, approved the final version of the manuscript, and interpreted the data. JW and GC designed the study. GC had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.