Prediction of the risk of cytopenia in hospitalized HIV/AIDS patients using machine learning methods based on electronic medical records

Background Cytopenia is a frequent complication among HIV-infected patients who require hospitalization. It can have a negative impact on the treatment outcomes for these patients. However, by leveraging machine learning techniques and electronic medical records, a predictive model can be developed to evaluate the risk of cytopenia during hospitalization in HIV patients. Such a model is crucial for designing a more individualized and evidence-based treatment strategy for HIV patients. Method The present study was conducted on HIV patients who were admitted to Guangxi Chest Hospital between June 2016 and October 2021. We extracted a total of 66 clinical features from the electronic medical records and employed them to train five machine learning prediction models (artificial neural network [ANN], adaptive boosting [AdaBoost], k-nearest neighbour [KNN] and support vector machine [SVM], decision tree [DT]). The models were tested using 20% of the data. The performance of the models was evaluated using indicators such as the area under the receiver operating characteristic curve (AUC). The best predictive models were interpreted using the shapley additive explanation (SHAP). Result The ANN models have better predictive power. According to the SHAP interpretation of the ANN model, hypoproteinemia and cancer were the most important predictive features of cytopenia in HIV hospitalized patients. Meanwhile, the lower hemoglobin-to-RDW ratio (HGB/RDW), low-density lipoprotein cholesterol (LDL-C) levels, CD4+ T cell counts, and creatinine clearance (Ccr) levels increase the risk of cytopenia in HIV hospitalized patients. Conclusion The present study constructed a risk prediction model for cytopenia in HIV patients during hospitalization with machine learning and electronic medical record information. The prediction model is important for the rational management of HIV hospitalized patients and the personalized treatment plan setting.


Background
The human immunodeficiency virus (HIV) not only cause damage to the function of the immune system, but also have a negative impact on the body's hematopoietic system (1). Cytopenia is one of the common complications of HIV infection (2) and the common types are anemia, thrombocytopenia and leucopenia. Within the HIV patients, anemia is an independently influential factor in both the acceleration of disease progression and the decline in quality of life (3). The prevalence of anemia ranges from 1.3 to 95% (4). Currently there are relatively few reports on the prevalence of leukopenia and thrombocytopenia and their associated factors. The most common type of leukopenia is neutropenia. Neutropenia affects 5 to 30% of patients in the early stages of HIV infection. Whereas in patients with late-stage HIV infection, the prevalence of neutropenia can reach 57 to 76% (5)(6)(7). The prevalence of thrombocytopenia among HIV patients ranges from 4.1 to 40% (8). Cytopenia may negatively affect outcomes of treatment and accelerate disease progression in patients with HIV (9). The causes of cytopenia in HIV patients are complicated. Currently reported factors that have been correlated with the occurrence of cytopenias in HIV patients include the direct effects of HIV infection, the effects of drug therapy and OIs (10)(11)(12). And CD4 + T cell counts as a marker of acquired immunodeficiency syndrome (AIDS) progression have also been proven to correlate with cytopenia (13).
Machine learning has had a wide range of applications in medicine in recent years, such as cancer diagnosis (14), medical imaging (15) and death prediction (16). Numerous machine learning algorithms have demonstrated their potential for application to large-scale biomedical and patient datasets. Machine learning can balance the deviation and variance of data. Machine learning can be utilized on datasets containing numerous multidimensional variables to identify high-dimensional, non-linear relationships between clinical features for the purpose of data-driven outcome prediction. This approach overcomes certain limitations of current risk prediction analysis methods. Machine learning models for medical big data based on electronic medical records will support doctors in clinical diagnosis and management.
Cytopenia continues to be a significant concern in numerous countries with limited resources. The severity of cytopenia and its associated factors can impact the effectiveness of highly active antiretroviral therapy (HAART). However, this issue has not received enough attention in many developing countries. Most reports on the prevalence and associated factors of cytopenias come from regions with high AIDS prevalence and developed countries. These data may be quite different from other regions in terms of patient characteristics, cytopenic status, and HAART, etc. (17)(18)(19)(20). The main aim of the present study was to construct a predictive model that accurately predicts whether cytopenia would occur during hospitalization in people with HIV. To develop more appropriate treatment plans for HIV patients, it is essential to understand the profile of cytopenias and the relevant factors (21). However, there have been few reports on cytopenias among HIV patients in China. Thus, gaining insight into the risk factors that contribute to cytopenia in patients with HIV and developing an accurate predictive model for cytopenia could facilitate early intervention and prevent its progression in this patient population. For clinicians, the model could be used to screen HIV patients who may experience cytopenia in the future, and thus take a more appropriate treatment approach.

Data collection
This study was carried out at Guangxi Chest Hospital. Guangxi Chest Hospital is located in Liuzhou, Guangxi. The hospital is the regional designated hospital for the treatment of serious infectious diseases. The study was carried out between June 2016 and October 2021 and enrolled a total of 6,220 hospitalized HIV infected patients. Through the hospital electronic medical record system identifying HIV patients with cytopenia. People with HIV who did not suffer from cytopenia on their admission were included as study participants. The diagnostic criteria for anemia is the same as that of the World Health Organization (WHO). A hemoglobin level < 110 g/L (women) or < 120 g/L (men) is defined as anemia. Anemia is graded as severe (hemoglobin <60 g/L), moderate (hemoglobin 60-89 g/L) and mild (hemoglobin 90-119 g/L for men or 90-109 g/L for women). Compared to anemia, leukopenia and thrombocytopenia do not have universally accepted cut-off values. We defined them using criteria that have been used in other studies (7,22). The criterion for leukopenia was total leukocytes <4.0 × 103/uL. Platelet counts <150 × 103/uL were considered to be thrombocytopenia. The classification criteria for mild thrombocytopenia, moderate thrombocytopenia and severe thrombocytopenia were 100-150 × 103/uL, 50-100 × 103/uL and less than 50 × 103/uL, respectively, Gunda et al. (9). If a patient has multiple admissions to hospital, the data of the most recent admission will be included as a priority. The results of laboratory tests on the patient's blood first collected on admission to hospital were included in the study. The patient was discharged from hospital or died during hospitalization then observation was stopped. Patients younger than 18 years old, patients who received radiation therapy within the past 45 days, and pregnant women were excluded from the study. Because the underlying conditions of these patients themselves may induce or exacerbate cytopenias. All of the patients were confirmed to be HIV-positive by enzyme-linked immunosorbent assay and immunoblot detection laboratory tests, and the diagnosis was consistent with national HIV diagnostic criteria.

Data preprocessing
We extracted sociodemographic and clinical information, as well as blood examination records, from the medical electronic record system of Guangxi Chest Hospital to construct a structured dataset for the study participants. The structured dataset included 66 variables: 13 clinical comorbidity/co-infection variables (tuberculosis, pneumocystis, candida infection, cryptococcus, herpesvirus, cytomegalovirus, pneumonia, electrolyte disorders, hepatitis (B or C), hypoproteinemia, diabetes, hypertension, and cancer) 6 demographic indicators (gender, age, ethnicity, marital status, actual days in hospital and residence) 47 laboratory indicators (CD8 + T cell count, CD4 + T cell count and levels of ALP, ALT, AST, CEA, etc.).Excluding variables with missing data greater than 15%. Used Random Forest to fill in the missing values for the structured dataset (23). All the above data processing steps were done by the numpy, pandas and sklearn packages of Python 3.8.6.

Model construction and evaluation
Whether cytopenia had occurred in HIV patients at hospital discharge was used as a outcome of the prediction model. The data was divided randomly into two datasets using Scikit-learn, a Python package (24). 80% of the data was utilized for training the machine learning models and adjusting their parameters. 20% of the data were used to test the models and fine-tune the hyperparameters. We used five machine learning classifiers (artificial neural network (ANN), adaptive boosting (AdaBoost), k-nearest neighbour (KNN), support vector machine (SVM) and decision tree (DT)) to create five models for predicting outcomes. The five machine learning classification prediction models were all constructed based on the sklearn package from Python 3.8.6.
The predictive ability of the prediction models was evaluated using the area under the receiver operating characteristic (ROC) curve (AUC), specificity, accuracy, sensitivity, and F1 scores. The evaluation indicators were varied from 0 to 1, corresponding to the worst and best scores, respectively. Using these metrics together allowed a more comprehensive evaluation and comparison of the classification effectiveness of different machine learning methods. The prediction model with the most effective performance evaluation indicators was selected as the final model. To explain the outcome of the bestperforming predictive model, we utilized the Shapley additive interpretation (SHAP) to calculate the contribution of each feature to the predicted outcome (25).

Statistical analysis
The analysis of data was conducted with Python version 3.8.6 and SPSS version 24 statistical software package (SPSS Inc., Chicago, IL). The descriptive statistics such as percentage, mean, median, IQR, and standard deviation were used as appropriate. Student's t-test was used to compare normally distributed continuous variables, while the Mann-Whitney U test was used for non-normally distributed continuous variables. The chi-square test was utilized to compare categorical variables. The tests were two-sided, and statistical significance was defined as p values less than 0.05.

Ethical statement
The Human Research Ethics Committee of Guangxi Medical University (ethical approval number: 20210172) and the Medical Ethics Committee of Chest Hospital (ethical approval number: 2022-011) approved this study. Informed consent was waived after review by the Chest Hospital Medical Ethics Committee. Patient information was de-identified, and confidentiality was maintained throughout the study.

Sociodemographic characteristics of study participants
In this research, the prevalence of cytopenia in hospitalized HIV patients was 19.3% (1,201/6220). The study included 2,187 qualified people living with HIV. Figure 1 showed the selection process for the patients included in the present study. The study participants had a median age of 56 years (interquartile range (IQR): 45-66 years). The median number of days in hospital for study participants was 21 days (interquartile range (IQR): 12-33 days). Among the 2,187 study participants, 1,686 (77.1%) were male, 1,673 (76.5%) were from rural areas and 1,296 (59.3%) were married. Over half (55.0%) of the sample were from ethnic minority groups, with the Zhuang ethnic group comprising the majority (48.0%). The cytopenia and non-cytopenia groups differed significantly in demographic characteristics, including gender, ethnicity, marital status, and residence address (p < 0.05). The essential features of the study participants were listed in Table 1.
We evaluated the median levels of important indicators in both cytopenic and non-cytopenic groups of patients with HIV. The hemocytopenic group had lower levels of CD4 + T cell count, CD45 + T cell count, CD3 + T cell count, cholinesterase (CHE), creatinine clearance (Ccr), prealbumin (PA) and total cholesterol (CHOL). There were also some laboratory indicators of interest that were significantly different, such as serum cystatin (Cys-C), triglycerides (TG) and chlorine (Cl). Detailed characteristics of the laboratory indicators were shown in Table 3.  Flow diagram of the selection of participants included in the present study.

Feature selection, model construction and evaluation
Using the sklearn package and the pandas package in Python 3.8.6 to achieve feature filtering of the data. We used recursive feature elimination (RFE) with random forest to select input features for a predictive model aimed at predicting the occurrence of cytopenia in HIV patients during hospitalization. Finally, 12 variables were selected from 66 variables as predictors of the risk of cytopenia in HIV patients. Among the 12 included indicators, 9 were laboratory examination indicators, including CD4 + T cell count, serum cystatin (Cys-C), standard bicarbonate (HCO3std), low-density lipoprotein cholesterol (LDL-C), creatinine clearance (Ccr), chloride (Cl), glutamyltransferase (GGT), monocytes-to-lymphocites ratio (Mono/ Lymph) and hemoglobin-to-RDW ratio (HGB/RDW), 3 clinical comorbidity/co-infection including electrolyte disturbances, hypoproteinemia and cancer.
The prediction models for the development of cytopenia in HIV patients during the hospitalization were constructed based on 12 features from the feature selection results. Table 4 displayed the prediction performance of the prediction models generated by the five machine learning algorithms. The ANN model demonstrated the highest sensitivity and specificity and therefore exhibited superior predictive power compared to other models. Figure 2 showed the ROC curves for the five models, with the ANN model displaying the most favorable results.

Explanation of risk factor
To better comprehend how the features integrated into the ANN prediction model contribute to the prediction results, we computed the SHAP values for each individual feature. The ANN prediction model generates a predictive value for each predicted sample. The SHAP value is a numerical score assigned to each feature in a given sample, indicating the degree of impact each feature has on the outcome and whether it is a positive or negative influence. The importance matrix diagram for the ANN model was shown in Evaluation of five machine learning algorithms based on the AUC of ROC curves.
Frontiers in Public Health 08 frontiersin.org Figure 3. The importance matrix ranks the features that affect cytopenias in hospitalized HIV patients, from most to least important. The importance matrix ranking results for the ANN prediction model were hypoproteinemia, HGB/RDW, cancer, LDL-C, CD4 + T cell count, electrolyte disturbance, Cl, Ccr, HCO3std, Mono/Lymph, GGT and Cys-C. The SHAP summary plot showed how each variable had an impact on the predicted outcome of the occurrence of cytopenia in hospitalized HIV patients (Figure 4). Each patient was assigned a point, and features were color-coded based on attribute values, with red indicating higher values and blue indicating lower values. According to the SHAP summary plots, hypoproteinemia and cancer were identified as the most significant features. In hospitalized patients with HIV, these two features were strongly and positively correlated with cytopenia. HIV patients presenting with these two clinical comorbidities were at significantly increased risk of developing cytopenia during hospitalization compared to HIV patients not presenting with these two comorbidities. HGB/RDW, LDL-C, CD4 + T cell count, Ccr, HCO3std and Mono/Lymph also had a significant effect on the occurrence of cytopenia in hospitalized patients with HIV. The risk of cytopenia increases as the value of these features decreases. The higher the value of Cl, GGT and Cys-C, the greater the risk of cytopenia. Hospitalized HIV patients with electrolyte disorders were more likely to develop cytopenia.

Discussion
This study conducted a retrospective analysis on a large sample size and identified Candida infection, hypoproteinemia, tuberculosis and pneumonia as the most frequent complications among hospitalized patients with HIV. This finding was similar to previous reports (26). The current study employed machine learning techniques and clinical features readily obtainable from electronic medical records to develop a predictive model for cytopenia risk in HIV patients during hospitalization. We evaluated and compared the predictive capabilities of five distinct machine learning models. The results showed that ANN models have the highest sensitivity and specificity. ANN model is widely used in clinical detection and pathology identification due to the good performance it has shown in recognition (27,28). In comparison to other machine learning models, ANN are able to effectively process non-linear relationships, which is important in many real-world problems (29). ANN consist of multiple neurons and layers that enable them to learn and represent very complex relationships and have better capabilities for implicit pattern and feature extraction in data. The hidden layer structure of ANN enables them to capture and represent complex relationships between input features, thus better adapting to different types of data (30). As far as the authors know, the current research is the first published study to use ANN models to predict the occurrence of cytopenia in HIV patients during hospitalization.
The combination of electronic medical records and machine learning has contributed to the development of complex prediction models (31,32). To enhance the transparency of the model's prediction process, we employed the SHAP method to compute the contribution of individual variables to the model's predicted outcome. The results showed that hypoproteinemia and cancer were important factors influencing the occurrence of cytopenia during hospitalization of HIV patients. We also identified HGB/RDW, LDL-C, CD4 + T cell count and Ccr were the variables that had a greater impact.
Previous studies have demonstrated that hypoproteinemia is a potential predictor of disease progression and mortality among individuals with HIV (33). A cohort study in West Africa that investigated the nutritional status of HIV patients who received HAART for 1 year reported that low albumin was associated with anemia (34). It is not coincidental that serum albumin levels have been claimed to be independently associated with severe anemia and could influence mortality and the outcome of HAART in HIV patients (35). There is growing evidence that hypoproteinemia has a dramatic impact on cytopenia in HIV patients, particularly anemia. There are two possible reasons why people with cancer are more likely to develop anemia; cancer causes difficulty in the production of red blood cells and shortens the survival time of red blood cells (36). Furthermore, anti-cancer treatments may harm healthy blood cells. Our study discovered that hypoproteinemia and cancer were significant factors contributing to cytopenia in HIV patients during hospitalization.
After analyzing all features included in the model, we found that HGB/RDW, LDL-C, CD4 + T cell count, and Ccr had the great impact on predicting the risk of cytopenia during hospitalization in HIV patients. Specifically, lower levels of HGB/RDW, LDL-C, CD4 + T cell count, and Ccr were associated with an increased risk of cytopenia. HGB/RDW as a new comprehensive biomarker has gradually attracted widespread attention. The lower HGB/RDW levels have been demonstrated to be associated with cancer development and poor prognosis (37,38). In the present case, HIV patients with lower levels of HGB/RDW had a higher risk of cytopenia during hospitalization. The lower HGB/RDW may represent abnormal erythrocyte homeostasis and deformed erythrocytes, leading to disturbed blood flow in the microcirculation (39), which may have contributed to the increased susceptibility of people living with HIV to cytopenias during hospitalization. Hemoglobin and RDW are easily accessible laboratory examination indicators. But HGB/RDW is rarely focused on during HIV treatment. The results of the present study showed that HGB/RDW is strongly associated with the development of cytopenias in people living with HIV and deserves greater attention.
Low LDL-C is often associated with long-term vegetarian diet (40), liver disease (41) and drug therapy (42). Low LDL-C has also been reported to be associated with chronic anemia (43). However, LDL-C has not been of particular concern in previous studies about risk factors associated with cytopenia in HIV patients. Although the mechanism of how lower LDL-C leads to cytopenia is not clear, there are some possible explanations. Possible explanations include erythrocyte fragility due to low cholesterol levels in the erythrocyte membrane (44), as well as LDL-related platelet activation and tissue factor expression (45). But the mechanism of how low LDL-C leads to cytopenia needs more further research to prove it.
As with previous studies, our research found that low CD4 + T cell counts are a risk factor for cytopenia in HIV patients. CD4 + T cell counts are closely correlated with HIV disease progression, and lower counts are typically indicative of advanced disease progression (46). The primary explanation for cytopenia, which results from low CD4 + T cell counts in HIV patients, is likely HIV-mediated hematopoietic suppression and direct T cell infection (10). Moreover, research showed that improved CD4 + T cell counts after HAART treatment have led to a reduction in the prevalence of cytopenia in HIV patients (47-49), indicating that HIV-related cytopenia is caused by HIV infection and immunosuppression (50).
Frontiers in Public Health 09 frontiersin.org  CCr is a sensitive marker of glomerular damage and an early indicator of kidney impairment. Lower CCr could lead to chronic kidney disease, and the common complications of chronic kidney disease include anemia (51). Concurrently, it has also been claimed that high serum creatinine is a significant predictor of anemia in HIV patients (52). GGT is an important indicator of liver function and an increased GGT level means impaired liver function. And abnormal liver function could cause a cytopenia (53). Both CCr and GGT reflect the organ function of HIV patients and the potential risk of cytopenia in HIV patients, but have not been focused on in previous studies. The levels of Cl, HCO3std and electrolytes provide valuable information about the body's metabolism, and their disturbances may indicate metabolic issues in HIV patients who were at a high risk of developing cytopenia. Mono/Lymph is demonstrated to be a predictor of the risk of developing tuberculosis in people living with HIV (54). And tuberculosis is one of the factors associated with the development of cytopenia in people living with HIV (9). Although Cys-C levels may be considered clinically insignificant and often overlooked, it is still important for predictive modeling purposes.
The present study used real-world data from electronic medical records to construct an ANN prediction model for predicting the risk of cytopenia in HIV patients during hospitalization using multiple clinical complications and clinical variables. We identified some risk factors associated with cytopenia in HIV patients that have not been focused on in previous studies. Finally, the predictive model can serve as a clinical screening tool to assess the risk of cytopenia in HIV patients during hospitalization, thus facilitating the development of more personalized and rational treatment plans. However, there were certain limitations in our study. Firstly, our study sample was predominantly limited to southern China and thus not indicative of the overall situation of individuals living with HIV throughout China. Secondly, the potential influence of medications and treatment regimens on the study outcomes was not taken into account during the data collection process. Thirdly, the prediction model in this study was not validated for stability using external data. Our model has been internally validated and demonstrates consistent and robust predictive ability for the results explored.

Conclusion
To sum up, this study utilized electronic medical records to gather demographic information, clinical complications, and laboratory test indicators of HIV patients. These clinical characteristics were then used to construct a predictive model to assess the risk of cytopenia in HIV patients. The predictive model has significant implications for improving the management of HIV patients and tailoring personalized treatment plans.

Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethics statement
The Human Research Ethics Committee of Guangxi Medical University (ethical approval number: 20210172) and the Medical Ethics Committee of Chest Hospital (ethical approval number: 2022-011) approved this study. Informed consent was waived after review by the Chest Hospital Medical Ethics Committee. Patient information was de-identified, and confidentiality was maintained throughout the study. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.

Author contributions
YH, JL, LH, JQ, and KZ designed the study and provided the correlative knowledge. YX, LS, YL, YJL, ZM, and KH collected and provided the data. LL, XW, and BX extracted data and cleaned data. BX and LL constructed the prediction model. KZ, LS, YL, and BX generated the figures and tables. YX, YH, HQ, XP, and BX wrote and edited the manuscript. All authors contributed to the article and approved the submitted version.

Funding
Frontiers in Public Health 11 frontiersin.org