Predicting the Risk of Hypertension Based on Several Easy-to-Collect Risk Factors: A Machine Learning Method

Hypertension is a widespread chronic disease. Risk prediction of hypertension is an intervention that contributes to the early prevention and management of hypertension. The implementation of such intervention requires an effective and easy-to-implement hypertension risk prediction model. This study evaluated and compared the performance of four machine learning algorithms on predicting the risk of hypertension based on easy-to-collect risk factors. A dataset of 29,700 samples collected through a physical examination was used for model training and testing. Firstly, we identified easy-to-collect risk factors of hypertension, through univariate logistic regression analysis. Then, based on the selected features, 10-fold cross-validation was utilized to optimize four models, random forest (RF), CatBoost, MLP neural network and logistic regression (LR), to find the best hyper-parameters on the training set. Finally, the performance of models was evaluated by AUC, accuracy, sensitivity and specificity on the test set. The experimental results showed that the RF model outperformed the other three models, and achieved an AUC of 0.92, an accuracy of 0.82, a sensitivity of 0.83 and a specificity of 0.81. In addition, Body Mass Index (BMI), age, family history and waist circumference (WC) are the four primary risk factors of hypertension. These findings reveal that it is feasible to use machine learning algorithms, especially RF, to predict hypertension risk without clinical or genetic data. The technique can provide a non-invasive and economical way for the prevention and management of hypertension in a large population.


INTRODUCTION
The expert system can learn medical knowledge and expert experience, and finally simulate expert diagnosis and treatment ideas and draw conclusions, which can help diagnose and analyze human diseases (1). The management of human diseases urgently needs an expert system to assist in real-time diagnosis and personalized prevention or treatment guidance. With the development of artificial intelligence (AI) and fuzzy logic, the effectiveness of expert systems in the medical field has been widely reported (2)(3)(4)(5)(6). As a core technology of AI, machine learning is the foundation of expert systems (7). Supervised machine learning algorithms have been used in traditional disease risk prediction models to improve the accuracy of classification (8).
Hypertension is a widespread cardiovascular disease (9,10), which has been the first risk factor of death and the third risk factor of the economic burden (11). Moreover, most of the occurrence and development of hypertension are symptomless. The continuous rise of blood pressure in hypertension patients usually has complications, such as arteriosclerosis, myocardial infarction, and stroke (12,13). Luckily, previous studies consistently indicated that the early stage of lifestyle modification can prevent and control the development of hypertension (14). Therefore, it is critical to access the individuals' risk of hypertension and to screen hypertension early. The hypertension risk prediction model can identify high-risk groups and screen out hypertension patients at an early stage. Individuals can cease the unhealthy behaviors to prevent and control the management of hypertension, with the early warning from lifestyle risk factor indicators (15)(16)(17)(18). Therefore, identifying lifestyle risk factors of hypertension and early identification of hypertension play an important role in the prevention and management of hypertension.
Existing hypertension risk prediction approaches can be roughly classified into queue-based and cross-sectional databased. The former focuses on obtaining the absolute risk of hypertension and requires long-time longitudinal data, which limits the application of modeling methods. In contrast, the latter one employs features extracted from cross-sectional data to evaluate the current risk of hypertension and screen out hypertension, which is also of great value in the prevention and management of hypertension. Probably the most related investigation in hypertension has recently been focusing on the current risk prediction by analyzing clinical indicators or genetic information. Ture et al. (19) constructed different hypertension prediction models based on clinical indicators. The performance of neural networks is superior to decision trees and traditional statistical algorithms. Elizabeth Held (20) generated a hypertension prediction model by using LR based on age, sex, smoke, age * sex and genetic information. With the help of the K-means algorithm to avoid sample imbalance to obtain balanced experimental data, Wang et al. (21) utilized a neural network to establish a hypertension prediction model. A Swedish hypertension risk model (22) employed LR to study heart rate, memory and metabolic characteristics and their association with the prevalence of hypertension. In a multi-ethnic study, Lopez-Martinez F et al. (23) utilized LR to build a hypertension prediction model based on the classification values of each risk factor, and the performance of the model was better than random guessing.
Extensive research efforts have been dedicated to the issue of hypertension risk prediction. However, there are still great difficulties in applying these models in practical applications in a large population because of the complexity of predictors' collection and the unsatisfied predictive ability of these models. Firstly, the predictors of these current models all contain biochemical indicators or genetic information, which requires a complex measurement and cannot be achieved in some situations, such as rural areas or some community health service centers. Secondly, compared with models based on biochemical indicators or genetic information, the model based on lifestyle risk factors can effectively identify the risk level of hypertension and contribute to targeted intervention. Thirdly, It is urgent to develop an effective model using only easy-tocollect risk factors (no biochemical and genetic information) to improve predictive performance. Furthermore, the poor interpretability of previous prediction models limits their practical application.
The objectives of this study were to evaluate and compare the performance of four different machine learning algorithms in predicting the risk of developing hypertension from easyto-collect information. And choose the best machine learning algorithm to develop a risk prediction model of hypertension based on easy-to-collect information. The four machine learning algorithms used in this study were RF, CatBoost, MLP neural network and LR.

Material
The data set used to construct the model in this paper comes from a physical examination center of a hospital in Beijing in China. A total of 29,750 cases of complete data were collected. Among them, there are 10,650 cases of hypertension and 19,100 cases of normal. Most of the normal cases are between 18 and 70 years old, and most hypertension cases are between 20 and 75 years old. To ensure a similar age distribution between hypertension and normal cases, the age of samples is further restricted between 20 and 70 years old. For the selected 10,625 hypertension cases and 19,080 normal controls, we took the following measures to clean up the data: eliminate the subjects with significant outliers (more than or equal to 3 times four Quantile intervals). After screening and data cleaning of inclusion and exclusion criteria, 10,623 cases of hypertension and 19,077 cases of normal controls were finally included in this study. We promise to keep the patients' information strictly confidential. According to the Helsinki Declaration, the study was approved by the Ethics Committee of the Hefei Institute of Physical Science, Chinese Academy of Sciences (No. Y-2018-29).
The anthropometric information and blood pressure of the subjects were measured by professional medical workers using standard measurement methods. According to the diagnostic criteria of the "Chinese Hypertension Prevention Guide, " hypertension is defined as having been diagnosed as hypertension in the hospital or the average systolic blood pressure> = 140 mmHg or diastolic blood pressure> = 90 mmHg in this physical examination. The anthropometric indicators mainly include height, weight and WC. During the measurement, the subjects were required to wear light clothes and maintain a correct standing posture. BMI is calculated according to the standard formula [BMI = weight/height 2 (kg/m 2 )]. WC is measured by using the tape around the subject for one circle at 1 cm above the navel (24). Professional medical staff used standardized epidemiological questionnaires to complete interviews of subjects for collecting basic demographic and lifestyle information. Family history refers to the hypertension status of one's parents. Smoke is defined as smoking every day and has been smoking for more than 6 months (25). Drink refers to drinking at least once a week and has been drinking for more than 6 months (26). Occupation refers to one's occupation type. The physical activity presents physical activity status, frequent physical activity refers to 30 min of moderate-intensity exercise performed at least three times a week (27). A healthy diet is defined as the total score of the healthy eating index (HEI) >51 (28). Psychological pressure refers to the total score of the Perceived Stress Scale (PSS) > = 43 (29). Table 1 shows the details of these variables.
To verify the performance of the four machine learning algorithms on our data, we randomly divided the dataset into a training set and a test set according to ratio 4:1. There is no significant difference in each variable between the training set and the test set. The main characteristics of the training set and the test set are shown in Table 2.

Feature Selection
The variables used to construct hypertension risk prediction model must meet the following two conditions: (1) it is an easyto-collect variable, including basic demographic information, anthropometric information, or lifestyle information; (2) It is a variable statistically significant to hypertension in univariate logistic regression analysis (p < 0.05) (30).

Machine Learning Algorithms
In this study, we used four machine learning techniques to develop four models based on easy-to-collect variables to predict the risk of hypertension: RF, CatBoost, MLP neural network and LR.

RF
RF is an ensemble machine learning method with decision trees as the base classifier (31). Each decision tree is built based on various sub-datasets and features. Therefore, each decision tree is different and independent, and finally, the classification result from the voting results of multiple decision trees is obtained. This approach allows reducing variance in decision trees (32). Thus, RF can analyze the classification characteristics with complex interactions, and it is very robust to noisy data and data with missing values. Meanwhile, the learning speed of RF is also very fast.

CatBoost
CatBoost is a new ensemble algorithm based on decision tree gradient boosting (33). CatBoost uses combined categorical features, which can take advantage of the connections between features and greatly enrich the feature dimension. Therefore, CatBoost is intrinsically more efficient and has better predictive performance compared with the traditional boosting algorithm in the case of categorical features.

MLP Neural Network
As a non-linear mapping model, MLP neural network is flexible and effective in modeling complex relationships between inputs and outputs (34). It includes an input layer, a hidden layer, and an output layer. Each layer is fully connected to the previous layer. The MLP neural network is trained according to the error backpropagation algorithm (35). It performs error analysis on the training and expected results each time, which helps change the weights and thresholds to obtain a model that the outputs are consistent with the expected results step by step. The process can be terminated when the error rate reaches sufficiently small.

LR
LR is a generalized linear regression analysis algorithm that can explore the relationship between a categorical dependent variable and several independent variables (30) and connect the values of the independent variables with the probability of the event defined by the dependent variable. The LR algorithm assumes that the predicted value is the linear addition of all products of independent variables and corresponding coefficients.

Hyper-Parameters Tuning and Model Development
To evaluate the performance of four machine learning models, we randomly divided the data into a training set and a test set according to ratio 4:1. For the four machine learning techniques, the training set was used to adjust the model parameters and construct the model, and the test set was used to evaluate the performance of the model. The training set was divided into a training subset and a verification set according to ratio 9:1, AUC of 10-fold cross-validation was used as the evaluation indicator to adjust the model parameters for constructing the optimal model. The training set was used to fit the model and generate the final model after the optimal parameters were determined. All the algorithms were implemented in Python 2.7.

Evaluation Metrics
The performance of the predictive model is evaluated by ROC (Receiver Operating Curve) curve, accuracy, sensitivity, specificity, and Youden index. Accuracy refers to the ratio of correctly classified samples to the total number of samples. Sensitivity refers to the proportion of positive samples that are predicted to be positive. Specificity refers to the proportion of negative samples that are predicted to be negative. The classification confusion matrix (36) is shown in Table 3. The value of AUC is equal to the probability that the prediction value is greater for a randomly given positive sample than a randomly given negative sample (37). The calculation formula of AUC is as follows:  In the above formula, TPR stands for true positive rate and FPR stands for false positive rate.

Feature Importance
One weakness of machine learning methods is that the learning process is a black box operation, and the results are poorly interpretable. In this study, we calculated the importance of each feature to improve the interpretability of the model. To calculate the importance of a feature, we repeated the testing process 10 times. In each testing process, we successively permuted the values of each feature in the test set and calculated the corresponding decrease in the AUC. The importance of a feature is measured by the average decrease in the AUC of the test set. The larger the value means the greater the contribution of the feature to the model, that is the greater the importance of the feature.

Selected Features
To select the input features for the prediction model, a univariate logistic regression analysis was utilized separately for 11 easy-tocollect hypertension risk factors on the training set. According to the variable inclusion criteria of statistical significance p-value < 0.05, psychological pressure was excluded (p = 0.097). Finally, 10 variables of age, gender, BMI, WC, family history, occupation, smoke, drink, healthy diet and physical activity were selected as the input features of the model. Table 4 presents the results of the univariate logistic regression analysis.

Model Hyper-Parameters
Based on the 10 selected easy-to-collect risk factors, the training set was used to determine the optimal hyper-parameters for RF, CatBoost, MLP neural network and LR, respectively. The hyperparameters of each model under optimal performance are shown in Table 5. Default values were set for other unlisted parameters in the four machine learning algorithms.

Model Performance
As shown in Table 6, the AUC for the test set on the RF model is the best, which is 0.92, followed by the CatBoost model with an AUC of 0.87, then the MLP neural network with an AUC of 0.78, and the LR model with an AUC of 0.77. The ROC curves on the test set of four models are shown in Figure 1. The RF model outperforms the other three models significantly. Due to the imbalance phenomenon in the dataset, we adjusted the threshold to achieve the maximum Youden index. We obtained the classification confusion matrix after the threshold was determined for each model. Refer to the accuracy, sensitivity, specificity and Youden index of the test set. As shown in Table 6 Based on the above results, the RF model performs the best on most evaluation metrics, including AUC, accuracy, sensitivity

Feature Importance
We calculated the importance of each feature in the RF model, which achieved the best performance. The order of importance of each feature is shown in Figure 3. The top 4 features in order of importance were BMI, age, family history, and WC. Later, smoke, drink, gender, occupation, healthy diet, and physical activity were the features ranked 5 to 10 in order of importance.

Principal Findings
In this study, four machine learning algorithms were evaluated and compared for hypertension risk prediction, based on easy-to-collect risk factors. The risk factors included 4 basic demographic indicators (gender, age, occupation and family history of hypertension), 2 anthropometric indexes (BMI and WC), and 4 lifestyle indicators (healthy diet, smoke, drink and physical activity). The results indicated that, compared with the LR model (AUC: 0.77), the performance of the three non-linear models was better. Thus, there is a nonlinear relationship between the independent variables and the dependent variable. Among the three non-linear machine learning models, the RF model outperformed CatBoost and MLP neural network models and got an AUC of 0.92 and an accuracy of 0.82. RF is a bagging ensemble algorithm based on multiple decision trees. The random selection of samples and features is further introduced in the training process. In the RF algorithm, there is no dependence between weak learners, and parallel operation can be achieved. All these attributes contribute to the excellent performance of RF in many classification studies (38,39). CatBoost is an ensemble algorithm based on boosting, which is usually expert in dealing with categorical variables. As for the MLP neural network algorithm, it is more often utilized for processing unstructured data and data with complex structures. Thus, our sample data happens to meet the structured data and categorical features, which meet the demands of the RF algorithm and CatBoost algorithm. As expected, the two models have shown favorable performance in our data. In addition, the RF model  . Therefore, we believe RF is more reliable than CatBoost in terms of our data. Our results are consistent with a study of classification performance (38), in which the RF algorithm performed the best among the 179 classification algorithms on 121 UCI datasets.
Compared with statistical methods, the performance of the model built by the machine learning algorithm is better, but the disadvantage is the poor interpretability of the model. The process of machine learning to build a model is to learn the potential rules of input and output of training data, so it can fit complex non-linear relationships, and then get a trained model and predict new input data. However, the rule of the training data is unknown, so the process of machine learning is often called a black-box operation. To increase the transparency of the model and provide health education to residents in practical applications, we measure the effect of each feature on the performance of the RF model by calculating the average value of AUC reduction caused by permuting the values of each feature and explore the causal relationship between independent variables and dependent variable.
BMI, age, family history, and WC were the top four important features. Among them, BMI and age were the top two features. Wiewiora et al. (40) and Szpalski et al. (41) showed that obesity increased cardiac output and resistance of peripheral blood vessels, resulting in increased blood pressure. Among the many indicators of obesity, BMI was most closely related to hypertension (42). Mariunas et al. (43,44) showed that with the increase of age, the elasticity of blood vessels became poor. To supply the blood demand of the whole body, blood pressure would rise. Therefore, age was an important risk factor for hypertension. The risk factors that followed were family history of hypertension and WC, A large population study (45) showed that people with a family history of hypertension had 1.79 times higher risk of hypertension than those without a family history of hypertension in China. The blood pressure level and prevalence of hypertension in those whose parents both had hypertension were significantly higher than those whose father or mother had hypertension. The prevalence of hypertension in those whose parents were both hypertension was about twice that of those without a family history of hypertension. These results are consistent with this study. WC was an important indicator of central obesity. Previous studies (46,47) indicated that the risk of hypertension in centrally obese patients was much higher than that of the normal population. Therefore, WC was also an important risk factor for hypertension. Our results showed that the significance of WC on the incidence of hypertension was still great even after considering the effect of BMI on the incidence of hypertension. This was consistent with previous studies (42,48). Smoke and drink were the next two important risk factors. Then, in order of importance, the risk factors were gender, occupation, healthy diet, and physical activity.
Although limited by the complexity of modern machine learning algorithms, we still cannot intuitively understand the relationship between independent and dependent variables in the model. However, the importance ranking of features indicated that the underlying rules in the data set learned by the RF algorithm were consistent with the findings of previous studies, which suggested that older and obese people had the highest risk of hypertension, and other unhealthy lifestyles would increase the risk of hypertension.
The RF model constructed in this study has made significant progress compared to similar previous hypertension prediction models. As shown in Table 7, we reviewed previous studies of hypertension prediction models. Ture et al. (19) built a hypertension prediction model based on lipoprotein (a), triglyceride, uric acid, total cholesterol and other biochemical indicators using the neural network, the calculated AUC of the model was 0.81. LR was used in the Swedish hypertension risk prediction model based on age, gender, BMI, heart rate, glycolipid parameters and other memory elements, the model got an AUC of 0.66 (22). Wang et al. (21) used a neural network to build a hypertension prediction model. The model with 10 hidden layers has the best performance with an AUC of 0.77. A hypertension prediction model based on genetic information, which was constructed using LR, achieved an accuracy of 0.77 (20). Lopez-Martinez et al. (23) utilized LR to construct a hypertension prediction model based on independent risk factors. The prediction model got an AUC of 0.73 and outperformed random guessing. Most models indicated a fair agreement with the final diagnosis for AUC values between 0.7 and 0.8. In this study, the RF model achieved a higher AUC (0.92) compared with the previous models. Studies have indicated that different ethnic populations have different characteristics of hypertension (49,50), which likely impacts different AUCs for different models. Nevertheless, this study revealed a superior ability of the RF algorithm in distinguishing high-risk and lowrisk populations of hypertension.
On the other hand, the input variables of the previous hypertension prediction model all contained biochemical indicators or genetic information. Lipoprotein, triglyceride, uric acid, total cholesterol were required in Ture M's research (19). Although Wang et al. (21,23) used a questionnaire to obtain predictors, the information about dyslipidemia and diabetes were required in Wang's research (21) and the information on kidney disease and diabetes were needed in Lopez-Martinez' s research (23). Genetic information was required when using Held E's model (20). Glucose and lipid parameters were the input variables of the model in Fava C's research (22). The acquisition of these variables requires biochemical testing, which makes these models unavailable for residents who cannot carry out biochemical testing on time. Thus, these models are not suitable and practical for hypertension prediction in a large population, which limits the application of these models in the prevention and management of hypertension. Different from previous prediction models constructed in other populations, the input variables of the RF model in this study are non-invasive and can be easily collected, which facilitates the application of the model.

Limitations of This Study
This study still has several limitations. Firstly, the data set used for model construction in the study was derived from cross-sectional data of physical examination. Although the model cannot predict the absolute risk of hypertension, it can distinguish high-risk and low-risk groups of hypertension. Secondly, the data used in the study were collected from a local hospital, which means it can only represent the characteristics of hypertension among residents in this specific area. Therefore, the generalization of the RF model established in this study to other regions needs further research and confirmation. Lastly, we did not evaluate the effect of all possible lifestyle variables because they were not included in the health examination. Therefore, occupation was the only new risk factor for hypertension identified in this study. Further research needs to incorporate more lifestyle information.

CONCLUSIONS
In this study, we evaluated and compared four machine learning algorithms in predicting hypertension risk based on easy-to-collect risk factors. Dataset was health checkup information collected through a physical examination in a hospital in Beijing. Results showed that the RF model outperformed the other three machine learning methods, and it performed an AUC of 0.92, an accuracy of 0.82, a sensitivity of 0.83, and a specificity of 0.81. The results revealed that the RF model could distinguish high-risk and low-risk populations of hypertension based on easy-to-collect variables. Thus, the RF model has a great application value in the prevention and management of hypertension.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the Ethics Committee of the Hefei Institute of Physical Science, Chinese Academy of Sciences. The patients/participants provided their written informed consent to participate in this study. Written informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.