Machine learning in predicting T-score in the Oxford classification system of IgA nephropathy

Background Immunoglobulin A nephropathy (IgAN) is one of the leading causes of end-stage kidney disease (ESKD). Many studies have shown the significance of pathological manifestations in predicting the outcome of patients with IgAN, especially T-score of Oxford classification. Evaluating prognosis may be hampered in patients without renal biopsy. Methods A baseline dataset of 690 patients with IgAN and an independent follow-up dataset of 1,168 patients were used as training and testing sets to develop the pathology T-score prediction (T pre) model based on the stacking algorithm, respectively. The 5-year ESKD prediction models using clinical variables (base model), clinical variables and real pathological T-score (base model plus T bio), and clinical variables and T pre (base model plus T pre) were developed separately in 1,168 patients with regular follow-up to evaluate whether T pre could assist in predicting ESKD. In addition, an external validation set consisting of 355 patients was used to evaluate the performance of the 5-year ESKD prediction model using T pre. Results The features selected by AUCRF for the T pre model included age, systolic arterial pressure, diastolic arterial pressure, proteinuria, eGFR, serum IgA, and uric acid. The AUC of the T pre was 0.82 (95% CI: 0.80–0.85) in an independent testing set. For the 5-year ESKD prediction model, the AUC of the base model was 0.86 (95% CI: 0.75–0.97). When the T bio was added to the base model, there was an increase in AUC [from 0.86 (95% CI: 0.75–0.97) to 0.92 (95% CI: 0.85–0.98); P = 0.03]. There was no difference in AUC between the base model plus T pre and the base model plus T bio [0.90 (95% CI: 0.82–0.99) vs. 0.92 (95% CI: 0.85–0.98), P = 0.52]. The AUC of the 5-year ESKD prediction model using T pre was 0.93 (95% CI: 0.87–0.99) in the external validation set. Conclusion A pathology T-score prediction (T pre) model using routine clinical characteristics was constructed, which could predict the pathological severity and assist clinicians to predict the prognosis of IgAN patients lacking kidney pathology scores.


Introduction
Immunoglobulin A (IgA) nephropathy (IgAN) is one of the most common forms of glomerulonephritis worldwide. The clinical manifestations are heterogeneous, ranging from asymptomatic proteinuria or microscopic hematuria to rapid deterioration in kidney function (1). It was reported that approximately 20%-30% of patients with IgAN would progress to kidney failure within 20 years (2). Therefore, early identification of high-risk patients with IgAN prone to ESKD is beneficial for early intervention in delaying disease progression. Great endeavors have been taken by many researchers to search for the risk factors for developing ESKD in patients with IgAN. Generally accepted risk factors affecting the progression of IgAN included decreased glomerular filtration rate (GFR), 24-h proteinuria >1 g/day, hypertension, and renal pathological manifestations (3)(4)(5)(6)(7)(8)(9). These risk factors have been used to build various scoring models for predicting the prognosis of IgAN based on traditional statistical methods (4,(10)(11)(12)(13)(14). However, these scoring models are constructed by the small sample sizes and different pathological scoring criteria, which may affect the accuracy and generalization of these scoring models. Moreover, the interactions between the characteristics and their effect on ESKD, the non-linear relationship among predictors, and the effects of therapeutic regimens make the interpretation of the data more complicated.
Machine learning, as a branch discipline of artificial intelligence, has obvious advantages in processing high-dimensional and sparse data. Machine learning algorithms can learn the relationship between input features and target outcomes as well as the relationship between features through a large amount of training data. Several studies have successfully constructed ESKD prediction models for patients with IgAN through machine learning algorithms (15)(16)(17)(18)(19)(20). By comparing the performance of traditional statistical methods and different machine learning algorithms in predicting ESKD or halving of estimated glomerular filtration rate from baseline, Chen et al. showed that the XGBoost algorithm performed best (16). XGBoost, as a machine learning algorithm, assembles the weak prediction models to construct a prediction model (16,21). Several studies have tried to construct event prediction models for a specific clinical outcome based on the XGBoost algorithm (22,23). However, no matter whether it was a traditional prediction formula or a machine learning-based predictive model in IgAN, pathology scores showed consistently significant weighting among many parameters (15,16,19,24). In 2009, the Oxford classification, an international consensus, was proposed to classify IgA nephropathy based on histopathological features to predict its prognosis and guide clinical treatment. The revised Oxford classification in 2017 divided IgAN into five categories, namely, "(1) mesangial hypercellularity (M); (2) endocapillary hypercellularity (E); (3) segmental glomerulosclerosis (S); (4) tubular atrophy/interstitial fibrosis (T); (5) cellular/ fibrocellular crescents (C)" (25), which were shown to be the independent predictors in predicting renal outcome (24,26). Since 2009, over 20 validation studies have tried to prove the predictive value of the MEST scores in some retrospective cohorts of patients with IgAN, which provided consistent evidence that the mesangial hypercellularity (M), segmental glomerulosclerosis (S), and tubular atrophy/interstitial fibrosis (T) each reliably provided prognostic value by univariate analysis (26), but T lesion was suggested to be the strongest predictor of renal survival. Hernan et al. summarized the results of these studies and found that M was of independent prognostic value in 5 out of 19, E in 4 out of 19, S in 7 out of 19, and T in 13 out of 19 (26). The C-score was adopted in the revised classification system in 2017, and three of the five prognostic studies on IgA nephropathy showed that C-score was associated with poor prognosis (26)(27)(28). In the constructed IgAN prognosis prediction models, it was observed that the T lesions showed greater weight in predicting prognosis compared with many other clinical and pathological parameters (14,16). For example, in the prognosis prediction model constructed by Chen et al., there were three indexes that can be integrated to predict ESKD, namely, T, global sclerosis, and urine protein, among which the T-score ranked first in the weight of importance (16). However, the T-score is derived from the kidney biopsy, an invasive manipulation, sometimes refused by patients and cannot be repeated in clinical routine for detecting disease progression. Hence, it is of great significance to explore whether pathological T lesions can be predicted by the patient's clinical variables at the same time.
The purposes of our study are 1) to construct a pathology Tscore (T pre ) prediction model based on the patient's clinical variables at the same time which may be able to predict whether there is a pathological T lesion and 2) to evaluate whether the predicted T can be used to assist in predicting ESKD.

Study participants
This study had two independent datasets. Dataset 1, a baseline dataset without follow-up data, comprised 690 patients with IgAN. These patients received the kidney biopsy in our center but returned to local for follow-up. Dataset 2, a follow-up dataset (PKU-IgAN cohort), included 1,808 patients with IgAN who were registered and with long-term follow-up in the Peking University First Hospital IgAN database from 1997 to 2020 (29). All patients with IgAN were diagnosed based on the histologic and immunofluorescence study of the renal biopsy, and those with <8 glomeruli per biopsy section were excluded (29). After excluding 243 patients without blood lipid data, 28 patients presented at younger than 16 years of age, and 14 patients presented acute kidney failure, 1,523 patients in dataset 2 were finally enrolled in this study, consisting of 1,168 patients with Oxford MEST-C scores and 355 patients lacking Oxford MEST-C scores.
Finally, a total of 690 patients in dataset 1 and 1,168 patients with Oxford MEST-C scores in dataset 2 were enrolled in our study as the modeling group, and 355 patients without Oxford MEST-C scores in dataset 2 were enrolled in this study as the external validation group (Figure 1).
All clinical characteristics were collected at the time of the renal biopsy. The estimated glomerular filtration rate (eGFR) was calculated using the Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) formula (30). Renal biopsies were categorized according to established criteria for the Oxford MEST-C scoring system (24,26,31). Mean arterial pressure (MAP, mm Hg) was defined as diastolic pressure plus a third of the pulse pressure. The end-stage kidney disease (ESKD) was defined as eGFR <15 ml/min/1.73 m 2 , dialysis, or kidney transplantation. Our study was approved by the Ethics Committee of Peking University First Hospital (IRB number 2020Y197). Written informed consent was provided by all participants.

Pathology T-score prediction model
The pathology T-score prediction (T pre ) model, constructed by the stacking algorithm, was used to predict whether IgAN patients would have T lesions (yes or no). The stacking algorithm is an integrated machine learning algorithm that can summarize several The flowchart of this study. WSVM, weighted support vector machine; WRF, weighted random forest; WLR, weighted logistic regression; AKI, acute kidney injury. models and predict new observations. It utilizes the prediction of a collection of models as input for training a second-level model. This second-level model aims to find the best combination of the prediction of first-level models. Stacking can shield the capabilities of a range of well-performing models so that a better output prediction model can be achieved (32). In our study, we combined three machine learning algorithms, namely, support vector machine (SVM), random forest (RF), and logistic regression as first-level models, and then logistic regression as the second-level model to output the final probability of the binary Tscore (with or without tubular atrophy/interstitial fibrosis, T pre ).
The input variables used in this model were chosen by AUCRF (33), a method using the random forest to find the optimal set for prediction. Variables entered into the AUCRF included age, sex, body mass index, systolic arterial pressure, diastolic arterial pressure, mean arterial pressure, hypertension, eGFR, proteinuria, microhematuria, history of gross hematuria, serum IgA, serum uric acid, serum triglycerides, total cholesterol, high-density lipoprotein, and low-density lipoprotein.

Five-year ESKD prediction model
Several studies have demonstrated the value of tubular atrophy/ interstitial fibrosis (T) in predicting ESKD in patients with IgAN (16,19,24,34,35). To evaluate whether the predicted T-score could help predict ESKD and how effective it was, we constructed a 5-year ESKD prediction model based on the XGBoost algorithm. To illustrate the significance of tubular atrophy/interstitial fibrosis in predicting ESKD, we first constructed a 5-year ESKD prediction model with only clinical variables as input variables (base model). Then, the 5-year ESKD prediction model using clinical variables and the real pathological T lesions score (T bio , T0 was assigned 0, T1 and T2 were assigned 1) was also developed (base model plus T bio ) to evaluate the additive value of atrophy/interstitial fibrosis (T) in predicting ESKD. Finally, to evaluate whether the value of T pre in predicting ESKD of patients with IgAN was consistent with real pathological T lesions (T bio ) when the base model plus T bio was trained in the training set, the T bio of the testing set was replaced by the corresponding T pre predicted by the pathology T-score prediction model and then the testing set was used to evaluate the model performance (the base model plus T pre ). For the base model plus T pre , the purpose of training the model using real pathological T-score (T bio ) was for the model to learn the true value of T for predicting ESKD.
XGBoost is a kind of ensemble of the decision tree, whose advantages include higher-order interactions and complex nonlinear relationships between the model features and the outcome (21). It has been shown to achieve impressive performance in predicting renal failure risk and provide explanations for variables by ranking their importance (16,34). We also applied other machine learning algorithms to our data set for evaluating whether the predicted T could be used in ESKD prediction models based on different algorithms, including RF, penalized regression, artificial neural network (ANN), and SVM.
Characteristics selected by the Cox proportional hazards model were collected at the time of the renal biopsy at enrollment [age, sex, systolic arterial pressure, diastolic arterial pressure, proteinuria, eGFR, serum IgA, serum uric acid, serum triglycerides, total cholesterol, low-density lipoprotein, and history of previous use of renin-angiotensin system (RAS) inhibitors and immunosuppressants as well as pathological T lesions], whereas the binary outcome (ESKD within 5 years after diagnostic kidney biopsy, yes or no) represented the output data. For these variables, we imputed missing values to the means for continuous characteristics and the mode for categorical characteristics. Because of missing information on serum triglycerides, total cholesterol, and low-density lipoprotein in some cases, 243 patients without blood lipid data were excluded to avoid inaccuracy due to missing value filling ( Figure 1).
To confirm that the T pre can be used in the ESKD prediction model at multiple levels, we also constructed a lifetime ESKD prediction model based on XGBoost. The process and approach were the same as building the 5-year ESKD prediction model. The primary outcome was time-to-event ESKD. The survival time for the kidney without ESKD event was calculated from the kidney biopsy to the last follow-up.
The XGBoost was allowed to generate boosting trees at most 110 times, and the maximum depth of each tree was constrained to 5. To avoid overfitting, we further set the L2 regularization term on weights as 1 and stop training if the performance did not improve by more than 15 rounds. At last, the optimal prediction model parameters and architectures were selected by the five-fold cross-validation.
The patients of dataset 2 without Oxford MEST-C scores combined with the corresponding T pre were used as an additional external validation set to evaluate the performance of the ESKD prediction model using T pre .

Statistical analysis
The sociodemographic and clinical variables were calculated and expressed as the mean ± standard deviation for variables with approximately symmetrical distributions and as median (interquartile range 25th-75th percentile) for variables with skewed distribution. All categorical variables are expressed as frequencies and percentages. Univariate analyses based on the Cox proportional hazards model (36) were conducted to evaluate the association between the baseline clinical characteristics and ESKD event. Clinical characteristics associated with ESKD event in univariate analysis (P < 0.05) or if they were clinically relevant were used as input features of the 5-year ESKD prediction model.
For predicting 5-year ESKD status (yes or no) and T-score (0 or 1), the performance of the models was assessed by calculating the accuracy, sensitivity, specificity, and area under the receiver operating characteristic (ROC) curve (AUC). For predicting lifetime ESKD risk, we quantify the performance of the model by concordance statistic (C-statistic), which is a general concept of the area under the curve (AUC) for time-to-event survival data (37).
The C-statistic compares the rank of predicting probability and the rank of the survival time in the real world. The calibration ability of the models was assessed by the Hosmer-Lemeshow test and calibration scatter plot, in which P-value >0.05 indicated no very significant difference between the predicted probability predicted by the model and the true outcome frequencies during a certain time period. SPSS version 26.0 software and R 3.6.3 were used for the statistical analysis. All P-values were two-tailed, and P <0.05 was considered statistically significant.

Characteristics of the study participants
The clinical characteristics of 690 patients with IgAN in dataset 1 are shown in Table 1. The mean age of these patients was 32.38 ± 11.32 years at the time of renal biopsy. The male-to-female ratio was 1.2:1. The mean arterial pressure was 94.44 ± 14.02 mm Hg. The median value of eGFR was 84.66 (range, 63.32-107.50) ml/min per 1.73 m 2 , and daily proteinuria was 1.38 (range, 0.66-2.89) g/day.

Performance of the pathology T-score prediction model
Feature reductions were conducted using the AUCRF algorithm, which was used to select the optimal random forest model with the least number of predictive variables to predict the presence or absence of T lesions. Clinical variables with a probability of selection higher than 0.7 were selected in repeated cross-validation of the optimal random forest model (optimal AUC = 0.82). Finally, the features selected by AUCRF for the T prediction model included age, systolic arterial pressure, diastolic arterial pressure, proteinuria, eGFR, serum IgA, and uric acid (Figure 2). The 690 IgAN patients with Oxford MEST-C scores in dataset 1 as the training set were taken to develop a pathology T-score prediction model. The 1,168 IgAN patients with Oxford MEST-C scores in dataset 2 as the testing set were used only for reporting the performance of the model and were not used for development or fine-tuning. If a predictive model has an AUC of higher than 0.75, it will be considered to have a good discriminating ability. The pathology T prediction model achieved a discrimination of 0.82 (95% CI: 0.80-0.85) [area under the receiver operating characteristic (ROC) curve (AUC)] in the testing set ( Figure 3A). The ROC curve had 0.74 sensitivity and 0.77 specificity, which indicated that it had better clinical utility.

Performance of the 5-year ESKD prediction model
The unadjusted Cox regression analysis suggested that sex, systolic arterial pressure, diastolic arterial pressure, proteinuria, eGFR, uric acid, triglycerides, and tubular atrophy/interstitial fibrosis (T) were risk factors for developing ESKD (Table 2). A study supported elevated serum IgA as a causal factor in IgA nephropathy through Mendelian randomization (38). Some studies have suggested the association between the poor prognosis of renal disease and dyslipidemia. Higher triglycerides and cholesterol levels have been proven to be independent risk factors for the progression of kidney disease (39). Hence, clinical variables (age, sex, systolic arterial pressure, diastolic arterial pressure, proteinuria, eGFR, serum IgA, uric acid, triglycerides, total cholesterol, low-density lipoprotein, history of previous use of RAS inhibitors and immunosuppressants) and the pathology T lesions (T bio , T0 was assigned 0, T1 and T2 were assigned 1) were used as the input variables of the 5-year ESKD prediction model.
To make the predictive model achieve a good performance, the 1,168 follow-up IgAN patients with Oxford MEST-C scores in dataset 2 were randomly divided into training and testing sets at a ratio of 8:2. The training set included 936 patients and the testing set included 232 patients. The training set was used to perform five-fold cross-validation to select the optimal prediction model. The testing set was used to assess the performance.
The performance value of the 5-year ESKD prediction model using only the above clinical variables as input variables (base model) was 0.86 (95% CI: 0.75-0.97) in the test set ( Figure 3B). To test whether the T bio could improve the predictive performance of the 5-year ESKD prediction model, we added T bio to the base model. An increase in AUC [from 0.86 (95% CI: 0.75-0.97) to 0.92 (95% CI: 0.85-0.98); P = 0.03] showed a better discriminating ability, which indicated that the T was important for judging the prognosis of patients with IgAN ( Figure 3B). To test whether T pre had a similar effect on judging the prognosis of IgAN patients, after training the 5-year ESKD prediction model with the training set, we replaced the T bio in the testing set with the corresponding T pre to see the discrimination effect. The AUC was 0.90 (95% CI: 0.82-0.99) in the testing set ( Figure 3B). The performance of the base model plus T pre did not differ from that of the base model plus T bio [AUC for the base model plus T pre 0.90 (95% CI: 0.82-0.99) vs. AUC for the base model plus T bio 0.92 (95% CI: 0.85-0.98), P = 0.52, Table 3], which showed that the value of the T pre in predicting the ESKD of patients was comparable to that of T bio . The calibration of the three prediction models is shown in Figures 4A-C. The P-values for the Hosmer-Lemeshow test of the base model, the base model plus T bio , and the base model plus T pre were 0.42, 0.79, and 0.92, respectively, which indicated that these models had a good calibration. These results suggested the importance of T in predicting ESKD, and T pre can be used to assist clinicians in assessing the prognosis of patients without pathology reports. Table 4 shows the performance of the 5-year ESKD prediction model based on different machine learning algorithms in the testing set using T pre . All models have good prediction performance, which A B FIGURE 3 Receiver operating characteristic curves of the prediction models. The receiver operating characteristic curves for (A) the pathology T-score prediction (T pre ) model and (B) the 5-year ESKD prediction model. The base model was the 5-year ESKD prediction model based on the XGBoost algorithm with only clinical variables as input variables. The base model + T bio was the 5-year ESKD prediction model based on XGBoost using clinical variables and the real pathological T lesions score (T bio , T0 was assigned 0, and T1 and T2 were assigned 1). The base model + T pre was when the base model plus T bio was trained using clinical variables and T bio , and the T bio of the testing set was replaced by the corresponding T pre predicted by the pathology T-score prediction model. The clinical variables used for the 5-year ESKD prediction model included age, sex, systolic arterial pressure, diastolic arterial pressure, proteinuria, eGFR, serum IgA, uric acid, triglycerides, total cholesterol, low-density lipoprotein, and history of previous use of renin-angiotensin system (RAS) inhibitors and immunosuppressants. AUC, area under the curve. Variables selected by AUCRF for the pathology T-score prediction model. The importance scores of the clinical variables with a probability of selection higher than 0.7 in repeated cross-validation of the optimal random forest model to predict the presence or absence of T lesions.
indicated that T pre could be used in ESKD predictive models built on different algorithms. For the lifetime ESKD prediction model based on XGBoost using only clinical variables (base model), the C-statistic was 0.82 (95% CI: 0.80-0.84) in the testing set. The discriminating ability of the base model plus T pre was also comparable to the base model plus

External validation of the ESKD prediction model using T pre
The 355 patients without MEST-C scores in dataset 2 were included as the external validation population for evaluating the performance of the 5-year ESKD prediction model. Because patients did not have MEST-C scores, the T pre predicted by the pathology T-score prediction model was used in the 5-year ESKD prediction model. The AUC of the 5-year ESKD prediction model using T pre based on XGBoost was 0.93 (95% CI: 0.87-0.99). We listed the AUC of the applied other machine learning algorithms in Table 5.
In the lifetime ESKD prediction model using T pre , the C-statistic was 0.92 (95% CI: 0.90-0.94). We have shown here that both models have a good performance in the external validation set, indicating the reliability of T pre for assisting in evaluating the prognosis of IgAN.

Discussion
We developed a pathology T-score prediction (T pre ) model that can predict whether the patient with IgAN may have tubulointerstitial lesions at this time based on clinical variables when the patient did not undergo a renal biopsy or did not want to repeat the renal biopsy for progression assessment. We further constructed the 5-year/lifetime ESKD prediction model based on the XGBoost algorithm to confirm the importance of T in   The clinical variables include age, sex, systolic arterial pressure, diastolic arterial pressure, proteinuria, eGFR, serum IgA, uric acid, triglycerides, total cholesterol, low-density lipoprotein, and history of previous use of renin-angiotensin system (RAS) inhibitors and immunosuppressants. T bio , the real pathological T-score quantified as either 0 (absent) or 1 (T1 or T2); T pre , the pathological T-score predicted by the baseline pathology T-score prediction (T pre ) model.
predicting ESKD, and T pre can replace the real pathological T lesions for assisting clinicians in evaluating the prognosis of IgAN patients without pathology reports. In addition, the ESKD prediction model built based on different machine learning algorithms had good discriminating ability by using clinical variables and T pre , which indicated the reliability and universality of T pre for assisting in evaluating the prognosis of IgAN. For developing the pathology T-score (T pre ) prediction model, we first used the AUCRF algorithm to select the clinical variables that may be associated with the tubulointerstitial lesions. Feature selection before training the predictive model can prevent dimensional disaster, reduce training time, prevent overfitting, enhance model generalization ability, and enhance the understanding of features and feature values, which also determines the upper limit of the effect of a machine learning task. The AUCRF is based on the RF algorithm, which is used for feature reduction based on optimizing the area under the ROC curve (AUC) of the random forest (33). It was found that age, systolic arterial pressure, diastolic arterial pressure, proteinuria, eGFR, serum IgA, and uric acid may be the clinical characteristics associated with tubular atrophy/interstitial fibrosis. Mechanism studies are needed to explore the inherent causality of these correlations and predictive capability. There have been reports indicating the association between reduced initial eGFR, higher initial MAP, proteinuria, and tubular atrophy/interstitial fibrosis (31). Next, we used the stacking algorithm to construct the pathology T-score prediction (T pre ) model based on the clinical characteristics selected by the AUCRF. A single learner has over-or underfitting problems, and to obtain a learner with excellent generalization performance, we can train multiple individual learners to form a strong learner through a certain combination strategy. This method of integrating multiple individual learners is called ensemble learning. Stacking is one of the methods of ensemble learning. The advantage of integration is that different models can learn different features of the data, and the results after fusion tend to perform better (40). As our results showed, when we used an independent dataset as the testing set, the AUC of the pathological T-score prediction (T pre ) model reached 0.82, which indicates the good discriminating ability of this T pre prediction model.
A host of studies have indicated that pathological T lesions play an important role in predicting prognosis (14,35,41). At the same time, most current ESKD prediction models based on different methods or algorithms all include pathology T-score (14,16,19).
Nevertheless, a renal puncture is invasive, which may cause a series of complications and has a host of contraindications, such as severe hypertension, coagulation disorders, solitary kidney, and so on (42). Furthermore, the number of patients at high risk of renal puncture may increase in the near future because of the aging of the population and the increased use of anticoagulant medication (43). For the patients who lack the report of kidney biopsy or do not want to undergo repeat renal puncture for disease progression assessment and evaluation of the effect of drug therapy, the clinician could not assess the prognosis of these patients with IgAN by using the established ESKD prediction model. The pathology T-score prediction (T pre ) model we developed may solve this problem. We also constructed a 5-year/lifetime ESKD prediction model based on XGBoost to assess whether the value of T pre in predicting ESKD of patients with IgAN was consistent with real pathological T-score. The performance of the base model plus T pre was similar to the base model plus T bio , which showed that the T pre can replace the real pathological T-score for prognostic prediction.
As far as we know, this study is the first to construct a pathology T-score prediction model in IgA nephropathy. At the same time, it is also the first study to use a machine learning algorithm to identify clinical variables that may influence the development of tubular atrophy/interstitial fibrosis, which may be useful for assessing the prognosis and targeted medication guidance. However, there is a limitation in our study. The model has been developed and tested in a single-center cohort of patients with IgAN; therefore, multicenter prospective cohort and ethnic-based cohort studies are necessary, which will further confirm the reliability of the pathology T-score prediction model, expand the scope of application of the model, and provide possibilities for clinical application.
In conclusion, our pathology T-score prediction (T pre ) model is a reliable tool for predicting the presence or absence of pathological T lesions. At the same time, it can also be used to assist clinicians in predicting the prognosis of patients with IgAN. A prospective multicenter cohort study is necessary to explore the potential value and robustness of this T prediction tool in the management of IgA nephropathy.

Data availability statement
The data presented in the study are deposited in the GitHub repository (https://github.com/zhangd17-web/IGAN_MI). The characteristics used in the basic model include age, gender, SBP, DBP, eGFR, IgA, UTP, UA, TG, TCHO, LDL, history of corticosteroids/cytotoxic drugs, and renin-angiotensin system blockers.

Author contributions
Research idea and study design: HZ, X-JZ, LW, DZ, and LX. Data acquisition: LX, SS, X-JZ, and HZ. Data analysis/interpretation: LX, DZ, LW, HW, and X-JZ. Statistical analysis: LX and DZ. Supervision or mentorship: X-JZ, HZ, HW, LW, RC, GC, LL, SS, XZ, SH, LD, and JL. Each author contributed important intellectual content during manuscript drafting or revision and agrees to be personally accountable for the individual's own contributions and to ensure that questions pertaining to the accuracy or integrity of any portion of the work, even one in which the author was not directly involved, are appropriately investigated and resolved, with documentation in the literature if appropriate.