Comparison of Biological Age Prediction Models Using Clinical Biomarkers Commonly Measured in Clinical Practice Settings: AI Techniques Vs. Traditional Statistical Methods

In this work, we used the health check-up data of more than 111,000 subjects for analysis, using only the data with all 35 variables entered. For the prediction of biological age, traditional statistical methods and four AI techniques (RF, XGB, SVR, and DNN), which are widely used recently, were simultaneously used to compare the predictive power. This study showed that AI models produced about 1.6 times stronger linear relationship on average than statistical models. In addition, the regression analysis on the predicted BA and CA revealed similar differences in terms of both the correlation coefficients (linear model: 0.831, polynomial model: 0.996, XGB model: 0.66, RF model: 0.927, SVR model: 0.787, DNN model: 0.998) and R 2 values. Through this work, we confirmed that AI techniques such as the DNN model outperformed traditional statistical methods in predicting biological age.


INTRODUCTION
Chronological age (CA) is a commonly used indicator of aging. However, life expectancy varies considerably among individuals with equal or similar CAs due to diversity in genotypes, living habits, and environments. An individual aged 50 may have a physical function of those aged 60, and many people look older or younger compared to others at the same CA (even in twins). Therefore, it is well known that CA is not an optimal indicator for the aging progress (Jia et al., 2017).
Therefore, it has been acknowledged that there is an increasing need to obtain various agingrelated biomarkers and translate them into statistical models capable of reflecting overall aging status of an individual. Various statistical models have been devised based on cognitive age, physical health age, biological age (BA), work ability index, and vulnerability index, combining physical, physiological, and biochemical parameters using mathematical methods in consideration of the absence of standardized measures for aging in statistical models (Jia et al., 2017).
Among these, BA is a commonly used age estimation on an individual basis. One of the most well-known literatures on aging is BA estimation based on biomarkers identified for high correlation with age (Nakamura et al., 1998;Ingram et al., 2001;Jackson et al., 2003;Bae et al., 2008;Jee et al., 2012;Bae et al., 2013;KYE HWA LEE, 2013;Sebastiani et al., 2017;Horvath and Raj, 2018;Le Goallec and Patel, 2019). The underlying hypothesis for such BA estimation studies is that BA measured in relatively healthy adults better reflects their actual health status than CA does (Thompson and Voss, 2009;Li et al., 2018;Liu et al., 2019).
To predict BA, previous studies have employed traditional statistical methods such as the multiple linear regression (MLR), the principal component analysis (PCA), the Hochschild's method, and the Klemera and Doubal's method (KDM), using clinical biomarkers. In particular, statistical methods such as MLR and PCA have been the most popular for BA prediction using clinical biomarkers (Jia et al., 2017;Li et al., 2018;Liu et al., 2019;Liu, 2021).

Subjects
This study was conducted on 116,829 subjects aged 20 or older, comprising 80,373 men and 36,456 women, who received routine health check-ups from 2015 through 2017 at the university medical centers and community hospitals in Korea. We obtained permission from subjects who visited the hospitals for their annual health check-up to use their data excluding any identifiable items (e.g., name, resident id, etc.,) Informed consent for this was obtained from all participants. Those who have been found to have severe diseases such as cancer, malignant hypertension, uncontrolled diabetes, and heart, lung, liver, pancreas, and renal failure during health check-ups were excluded in order to comprehend changes in actual BA of each subject in the normal aging process.

Clinical Biomarkers
A routine health check-up included anthropometric measurements, cardiovascular and respiratory functions, and laboratory tests (blood and urine). The height, weight, lean body mass, and body fat were measured by using InBody (Biospace, Korea), a different segmental multi-frequency bioelectrical impedance device.
As for the waist circumference, the thinnest area between the inferior part of the lowest rib and the iliac crest was measured in an upright position. Hip circumference was measured at the level of the widest circumference over the great trochanters.
Blood pressure was measured manually using a sphygmomanometer after resting 5 min in a sitting position. Both forced vital capacity and forced expiratory volume in 1 s were measured by an electronic spirometer two times in a standing position, and better record was taken. Blood and urine samples were collected in the morning after an overnight fasting of longer than 10 h.

Project Pipeline
The project pipeline for this work is as follows. We used 116,829 samples of data and normalized the raw data of 36 markers, composing the dataset of 35 biomarkers excluding age. Then Linear/2nd Polynomial/XGB/RF/SVR/DNN Regression models trained on the final dataset to predict BA. Additionally, we applied the Permutation Feature Importance (PFI) function on each trained model to compute the Feature Importance ( Figure 1).

Statistical Methods and AI Techniques Used in BA Prediction Models
The K-Fold cross-validation was performed to train each model with splitting the dataset into five folds. That is, 105,146 samples out of the total 116,829 samples were used as the training dataset, while 11,683 samples (10%) served as the test dataset. The training and validation procedures were sequentially repeated five times.

Linear Regression Model
Linear regression analysis is an important statistical method for the analysis of medical data as it enables the identification and characterization of relationships among multiple factors (Mamoshina et al., 2018). In particular, polynomial regression analysis analyzes the relationship between two or more independent variables and one dependent variable, expressed as a polynomial as shown in Eq. 1 (Schneider et al., 2010).
In polynomial linear regression analysis, the least squared error estimation is used to produce a determinant as presented in Eq. 2, allowing calculation of the dependent variable Y through the independent variable X (Schneider et al., 2010).
In this paper, polynomial regression model using 30 independent variables was applied, and Y^could be inferred by calculating the regression coefficient β^through application of the training dataset. To evaluate the performance of this model, R 2 was computed as an error between the estimated BA Y^and the CA Y.

2nd Polynomial Regression Model
Nonlinear regression analysis was applied. In doing so, we calculated BA by applying the independent variable X of Eq. 1 (polynomial linear regression), with the addition of the quadratic terms, to Eq. 2. Then, R 2 was calculated as an error between the estimated BA and CA to evaluate the performance of this model.

XGB Regression Model
XGB, a scalable end-to-end tree boosting algorithm, has improved the performance of Gradient Boosting Machine (GBM) by proposing a novel sparsity-aware algorithm and weighted quantile sketch. XGB algorithm also effectively reduced overfitting (Chen and Guestrin, 2016). XGB algorithm can be useful in developing biomarkers as it can calculate the feature importance, which helps determine the usefulness of each variable. The XGB model can improve efficiency of booster by setting parameters for tree booster. This study was conducted by setting the parameters as shown in Table 1 below, and BA was derived from Eqs 1, 2. Based on these results, was calculated for performance evaluation.

RF Regression Model
RF regression analysis is an ensemble algorithm that operates by constructing a multitude of decision trees. RF applies bagging randomly to construct sub-trees, resulting in reduction of variance, bias, and noise, thereby making up for shortfalls of decision trees (Breiman, 2001). It consists of numerous randomized sub-trees, and training is run independently. Prediction is done on the basis of voting on the results of each sub-tree to produce an optimal result. As was the case in XGB, RF is also capable of computing the feature importance, making it possible to measure how each variable is useful. As the RF model requires parameters in constructing sub-trees, we set them as shown in Table 2 below. R 2 was calculated for the model's performance evaluation.

SVR Model
Support Vector Regression (SVR) is one of the regression method and uses the same principles as the SVM (Support Vector Machine) for classification. SVR constructs a hyperplane or set of hyperplane in a high-dimensional space that can be used for linear or nonlinear regression. As the SVR model requires parameters in constructing hyperplanes, we set them as shown in Table 3 below. R 2 was calculated for the model's performance evaluation.

DNN Regression Model
Deep learning solves problems of neural network algorithms and applies multilayer neural networks composition and  (Ravì et al., 2017). This study was conducted by applying the DNN regression technique, widely used for classification and regression analysis as described by the authors (Ravì et al., 2017) who summarized the strengths and weaknesses of each deep learning architecture. As shown in Figure 2, a DNN model uses the backpropagation algorithm, where the difference between the error and the correct answer is calculated and used to adjust the weight values (LeCun et al., 2015). Similar to the statistical technique, DNN regression defines the mean squared error (MSE) loss function as shown in Eq. 3, and aims to obtain highly accurate results by calculating weights that minimize the MSE. As was the case in abovementioned polynomial regression analysis, the loss function is expressed as a determinant, and gradient descent (GD) is used to find a value that minimizes the error. The MSE is calculated by comparing the outputs and the weights are updated in connection with gradient (error derivative) while going through the backpropagation process ( Figure 2). The core operation of DNN regression is to find the weight values that gradually minimize the MSE by repeating the above process.
We built the neural network for DNN regression by setting the parameters as shown in Table 4 below. R 2 was calculated using test dataset for the model's performance evaluation.

Statistical Analysis
In this work, we used Python version 3.9.0 software with the statistical significance set at p < 0.05. As for the validation of BA prediction accuracy, R 2 and RMSE (Root Mean Squared Error), important measures in regression analysis, were calculated and compared between models. Additionally, we compared PFI scores to measure the effects of clinical biomarkers on BA prediction. PFI analysis assigns a score to input features based on how useful they are in predicting a target variable. Although an index called the feature importance is provided for XGB and RF in its own model, it is not the case for a deep learning model. Due to such differences between models, it is not discriminative enough. Therefore, we used a python package called eli5 to provide a common index for this paper, while setting R 2 as a measure to assess PFI score.

Ethical Permission
This study is approved by the Korea Institute of Bioethics Policy (KoNIBP) Electromagnetic Concern Committee (e-IRB) as it was judged to be exempt from examination (confirmation of exemption from examination). All methods were performed in accordance with the relevant ethical guidelines and regulations.

Characteristics of Study Subjects and Correlation Between Age and Biomarkers
The data obtained from routine health check-ups (from 2015 through 2017) included 116,829 subjects consisting of 80,373  males and 36,456 females. Mean age was 45.51 ± 10.13 years, and the details of data analysis are presented in Table 5.
To investigate the correlation between CA and each clinical biomarker, we performed the Pearson correlation analysis between age and biomarkers. The statistical significance of clinical biomarkers was set at p < 0.05. With regard to the correlation between the variables in this study, the variables exhibiting the strongest correlation were LBM (r −0.822, p < 0.001), HT(r −0.730, p < 0.001), and FVC(r −0.695, p < 0.001) in that order ( Table 5).

Diagnosis of BA by Regression Models
Machine learning by optimizing the hyperparameters of the regression model was performed to evaluate the adequacy of the calculated BA. Our evaluation consists of an inner-loop step, which is optimizing the hyperparameter of the regression models, and the outer cross-validation step, as shown in Figure 3. Except for the linear regression and the 2nd-order nonlinear regression model, we got the optimal hyperparameters through estimated R 2 by iterating the inner-loop for XGB, RF among ensemble models, and SVR, DNN among nonlinear models in hyperparameter range in Tables 1-4. Here, a part of the subsets of hyperparameter sets is in Table 6 We adopted fivefold cross-validation to finding optimal hyperparameters using hyperparameter sets within the given range. For outer validation, we prepared test subjects by sampling 10% of all data randomly. To verify in depth, we iterated outer-loop 20 times and we calculated the mean of R 2 of subsets of hyperparameters to choose optimal hyperparameters set for the regression models. The results of the inner-loop are shown in Figure 4According to the result of the inner-loop, the 5th hyperparameter set of the XGB and RF models was chosen and the 4th hyperparameter set of the SVR model and DNN model was chosen as optimal hyperparameters. We applied optimal hyperparameter derived through inner-loop into each model and measured the mean of R 2 from test subjects to evaluate regression models. The result of outer validation is shown in Figure 5

Comparison of Six Age Prediction Models Used in AI
The coefficient of determination (R 2 ) is a measure of how close the estimated linear model is to the observed data. The stronger the correlation between the dependent variable and the independent variable, the closer it is to 1. The results obtained after training the six models are as shown in Table 7. Generally, machine-learning models obtained higher values than statistical models, with the DNN regression model showing the highest coefficient of determination of 0.99804. It indicates a strong correlation between the 35 independent variables and the dependent variable, age, in the DNN regression model. Also, the DNN regression model yielded the lowest error of 0.4479 in RMSE measured between CA and BA estimated from each model, in much the same way as R 2 results.

The Regression Between CA and BA
A simple linear regression analysis was performed to find out the linear relationship between the predicted BA and CA. We applied a conventional statistical method, instead of a machine learning technique, in investigating a linear relationship in each model. As shown in Table 8, the linear regression analysis revealed that AI models yielded higher correlation coefficients than statistical models. In particular, the correlation coefficient in the DNN regression model was about 0.99817, exhibiting a stronger linear relationship than any others. The linear relationships have been clearly confirmed through the observation of both regression analysis and distribution of BA and CA, as illustrated in Figure 6. In addition, AI models resulted in smaller RMSE values than traditional regression analysis models, implying better BA prediction accuracy.

Comparison of PFI Scores Between Six Age Prediction Models
Figure 7show PFI scores of six BA prediction models. With regard to traditional models, a small number of variables had the effects on BA prediction. However, almost all variables had the effects in the case of AI models. For instance, it was CCR and CR that had the most effects on BA prediction in the 2nd polynomial model. By contrast, all features in the DNN model affected BA prediction with the mean value of PFI score recording 0.39, higher than any other models. Figure 8shows the top 10 features that had the greatest impact on BA prediction. Regarding the variables that six models have in common: WT is shared by all four models. SEX, BMI, HT, and AST are shared by five models. CCR, CR, and WAIST are shared across all six models. The 2nd polynomial model that achieved better result among the statistical methods, the XGB model from the decisiontree-based ensemble series, and the DNN model from deep learning share the same three features shared by all five models (CCR, CR, and WAIST) with the addition of one more feature, AST.

DISCUSSION
In this work, we applied traditional regression methods (Linear, Polynomial) and ensemble methods (RF, XGB), and non-linear methods (SVR, DNN) for BA prediction. We also analyzed which regression model is proper in prediction of BA by estimating R 2 values.
According to the results of our experiments, it was confirmed that a nonlinear or an ensemble regression model is more suitable than a linear regression model. In order to explore the characteristics of the biomarkers we used, we checked the mean of biomarkers by age as shown in Table 9. As a result, it was confirmed that there are more nonlinear characteristics than linear characteristics. Therefore, it is reasonable that a nonlinear model is suitable as a regression model for BA due to many biomarkers of nonlinear characteristics. Also, among nonlinear models, the DNN model seems to be the most robust model in BA. It is caused by a nonlinear transformation that occurs while passing through the activation functions of several hidden layers. Specifically, it caused huge performance issues depending on the activation function chosen. Therefore, it is important to choose an activation function to build a good DNN regression model.
Regarding R 2 results, although comparison was limited due to differences in applied data and machine learning models, this study appears to have outperformed previous studies. In this work, the DNN model produced R 2 value of 0.998, the highest value among six models. Considering previous studies employing various biomarkers, R 2 was 0.6 by Peters et al. (2015), 0.75 in Pyrkov et al. (2018), 0.82 in Wang et al. (2021), 0.83 in Zhong et al. (2020), 0.89 in Hannum et al. (2013), 0.92 in Sagers et al. (2020), and 0.93 in Horvath's research (Horvath, 2013). It was good performance in nonlinear models like our study. This superior performance seems to come from differences in the types and numbers of biomarkers used in prediction models and applied AI models. Also, the data used in this work are provided with all the variables without omission, contributing to higher R 2 values than previous studies. Combining the results of this study and other previous studies, AI models including deep learning appear to have better predictive power than traditional statistical models. Accordingly, AI methods are expected to play a more prominent role in studies on aging.
This study could make up for the disadvantages of DNN regression model, mostly concerned about lack of explainability, by comparing the effects of each variable on BA prediction using PFI. Recently, numerous studies on explainable (XAI) have been carried out in the field of DNN regression model, using the feature importance. If researchers keep making progress on such studies, we can expect to see realization of explainable AI (XAI) services that are interpretable and explainable.
In this work, we used the health check-up data of more than 111,000 subjects for analysis, using only the data with all 35 variables entered. To compare BA prediction accuracy, we implemented both AI techniques and traditional statistical methods. It is noteworthy in that this study is the first to make such an attempt.
The key achievements of this study are as follows. First, this study compared and analyzed both traditional statistical methods and popular AI techniques to predict BA, finding out that AI models (especially the DNN regression model) outperformed statistical models in prediction accuracy. Secondly, BA prediction accuracy of the DNN model in this study was better than similar studies conducted before. Third, we compared and analyzed the effects of biomarkers on BA prediction accuracy by using a new technique like PFI score.
To conclude, this work confirmed that AI techniques such as the DNN model outperformed traditional statistical methods in predicting BA. If technical development continues on such areas as explainable AI (XAI), AI techniques will be more widely applied across medical and health management fields. TABLE 9 | Characteristics of mean of the biomarkers by age.

Characteristics Biomarkers
Linear (