Development and application of survey-based artificial intelligence for clinical decision support in managing infectious diseases: A pilot study on a hospital in central Vietnam

Introduction In this study, we developed a simplified artificial intelligence to support the clinical decision-making of medical personnel in a resource-limited setting. Methods We selected seven infectious disease categories that impose a heavy disease burden in the central Vietnam region: mosquito-borne disease, acute gastroenteritis, respiratory tract infection, pulmonary tuberculosis, sepsis, primary nervous system infection, and viral hepatitis. We developed a set of questionnaires to collect information on the current symptoms and history of patients suspected to have infectious diseases. We used data collected from 1,129 patients to develop and test a diagnostic model. We used XGBoost, LightGBM, and CatBoost algorithms to create artificial intelligence for clinical decision support. We used a 4-fold cross-validation method to validate the artificial intelligence model. After 4-fold cross-validation, we tested artificial intelligence models on a separate test dataset and estimated diagnostic accuracy for each model. Results We recruited 1,129 patients for final analyses. Artificial intelligence developed by the CatBoost algorithm showed the best performance, with 87.61% accuracy and an F1-score of 87.71. The F1-score of the CatBoost model by disease entity ranged from 0.80 to 0.97. Diagnostic accuracy was the lowest for sepsis and the highest for central nervous system infection. Conclusion Simplified artificial intelligence could be helpful in clinical decision support in settings with limited resources.


Introduction
Although there have been successes in decreasing the disease burden of infectious diseases, there has been a dramatic emergence of infectious diseases, and they remain significant public health challenges (1). Southeast Asia is one of the 'hot spots' for infectious diseases, and it has experienced a rapid surge of infectious diseases and emerging infectious diseases. This rapid increase in disease burden results from multiple reasons, including environmental factors (2,3), changes in biodiversity (4), and economic factors (5). Several infectious diseases such as dengue fever (6), malaria (7), and central nervous system infection (8) are imposing a heavy disease burden on Southeast Asia as their prevalence grows.
Artificial intelligence (AI) is increasingly being used in fields of medical practices. Particularly, AI was proven to help assist medical decision-making. For example, AIs for diagnosing respiratory diseases (9, 10), cardiovascular diseases (11), and infectious diseases (12,13) have been developed and used for diagnostic assistance. Particularly for infectious disease management, AIs to support clinical decision-making for managing emerging infectious diseases, including tuberculosis (14,15), vector-borne diseases (16,17) and COVID-19 (18,19) had been developed. Such previous models have achieved high diagnostic accuracy, proving their effectiveness in medical decision-making for infectious diseases (16,20,21).
Previously, regression-based classifiers had been widely used for developing prediction model for clinical decision making, as it is highly intuitive and often one of the models with highest predictability for dichotomous outcome (22-24). However, in order to apply logistic regression model, data must conform statistical assumptions such as avoidance of multicollinearity of independent variables and independence of observation (25). Machine learning classifiers can avoid this issue and can be applied to wider range of unstructured dataset, and therefore have been implemented to develop and improve artificial models for disease (20,26). A number of machine learning methods, including artificial neural networks (27), XGBoost (28), and support vector machine methods (29,30) are being used for diagnostic assistance and clinical decision making.
Up to this date, however, little has been documented on applying AI for public health in resource-limited settings of lowand middle-income countries (LMICs) (31). Although there are vigorous activities on developing and using AI for public health in resource-limited settings (32)(33)(34), its application is still at its elementary level. Several difficulties in the application of AIs, such as challenges in building and maintenance of expert systems (35), limitations in IT infrastructure (36), differences in socioeconomic contexts (36, 37) and lack of personnel to supervise the procedure of development and application (38) hinder effective use of AIs for public health in the resourcelimited setting of LMICs. To increase the availability and accessibility of AI in LMICs, AI needs to be tailored to the resource-limited settings and fit for local sociomedical contexts and infrastructure needs (31,39).
Our objective of this research was to develop a simplified version of AI that we could effectively apply in resource-limited settings. We conducted a pilot study collaborating with local authorities and healthcare institutions in Da Nang, Vietnam. We tried to develop an AI for diagnosing infectious diseases that impose a heavy disease burden in the Central region of Vietnam.

Developing questionnaire
We selected seven infectious disease categories that were most common in central Vietnam or imposed a heavy disease burden on central Vietnam: mosquito-borne diseases, acute gastroenteritis, respiratory tract infection including coronavirus disease (COVID-19), pulmonary tuberculosis, sepsis, central nervous system (CNS) infection, and viral hepatitis. Classification of disease entity was done following ICD-10 diagnostic codes (Supplemental material 1). We used the Delphi method to develop questionnaires for patient assessment and history taking. A preventive medicine specialist in English initially created a questionnaire. Then, it was reviewed by a Korean preventive medicine specialist, a Korean public health specialist, two Vietnamese internal medicine specialists, a Vietnamese public health specialist, and an English-Vietnamese interpreter. After review and feedback, a professional translator translated the questionnaire set into Vietnamese. We have attached the final questionnaire as Supplemental material 2. Before data collection, we conducted two rounds of a pilot study to receive feedback and edit questionnaires: once on 5 Vietnamese individuals residing in Korea and once on 10 Vietnamese recruited from Da Nang.
After the pilot study, a Korean preventive medicine specialist developed an instruction manual for research personnel, which was used to educate physicians and nurses at the Department of Tropical Medicine, Da Nang Hospital, before conducting the survey. In addition, a survey application was developed and installed to tablet PCs used for data collection.

Study participants
We recruited patients who had been diagnosed with either of the target disease entities. About 160 participants for each disease category were recruited to ensure the model's predictive power. Due to decreased outpatient visits due to the COVID-19 outbreak in Vietnam, we used a two-track recruitment strategy. Patients admitted to the department of tropical medicine, Da Nang Hospital, Da Nang Hospital, and Da Nang Lung Hospital from November 8th, 2021, to January 1st, 2022, were .
/fpubh. . recruited and responded to the survey. Simultaneously, research personnel conducted a phone survey on patients who had been diagnosed with target diseases. We recruited a total number of 1,131 patients either prospectively or retrospectively. We excluded two participants who were diagnosed with other conditions were excluded from the final analysis.

Data collection and management
All participants then underwent an interview with a developed questionnaire and physical examination. Data was collected using the application on a tablet PC, then directly transmitted to the server. We measured systolic and diastolic blood pressure with a standardized sphygmomanometer after 5 min of rest. In addition, other vital signs such as pulse rate, respiratory rate, and body temperature were measured by physicians and nurses at the Department of Tropical Medicine, Da Nang Hospital. For retrospectively recruited participants, we collected records on electronic medical records (EMRs) of Da Nang Hospital, and we did an additional telephone survey to collect data. After the survey, the physician gave a final diagnosis to participants and was compiled with its corresponding ICD-10 codes. Following ICD-10 codes and diagnosis, participants were categorized into seven disease entity subgroups.

Development of artificial intelligence (AI) for disease classification
Survey results were processed to a dataset with 211 independent variables. We used three different algorithms (XGBoost, LightGBM, CatBoost) to develop AI for prediction and compared the predictive accuracy of the three models. XGBoost, one of the Gradient Boosting Models (GBM), is an ensemble model of decision trees (40). By implementing parallel processing and CART (Classification and Regression Tree) model-based regression, XGBoost works extremely faster compared to the previous gradient models and efficiently handles overfitting problem of GBM (40). LightGBM utilizes gradient-based one-side sampling (GOSS) and Exclusive Feature Bundling (EFB), adopting the leaf-wise tree grwoth algorithm unlike level-wise growth algorithms commonly used in previous GBMs (41). LightGBM has considerably lower false predictions due to application of leaf-wise tree growth, and it has faster training speed and lesser memory usage compared to conventional GBMs such as XGBoost by efficiently reducing the number of data instances and features (41). However, due to such characteristic, insufficient sample size may cause overfitting in LightGBM. CatBoost builds the base model with the residual error of independently sampled sub-dataset. The model is continuously updated by taking the residual error with the remaining dataset, solving the problem of prediction-shift (42). In addition, CatBoost creates clusters for each category during training, efficiently reflecting the categorical features to the model algorithm (42). The implementation of ordered boosting algorithm and ordered target statistics (TS) accelerates the training process, increases predictability, and reduces the possibility of overfitting (42). Moreover, the base parameters are already optimized in CatBoost, which minimizes the need of hyperparameter tuning (42).
We pre-processed the raw dataset into numerical dataset. For numeric variables, we modified the value to "-1" for missing values and did not modify other responses before including them into the final dataset. For non-numeric features, we modified the value to "1" if the answer exists and to "-1" if not. We did not consider the specific response of the question, as the presence of the feature was more important in predicting results.
The preprocessed dataset was divided into a training set, validation set, and test set. First, we divided the preprocessed data into a training set and a test set with a ratio of 9:1. Although there is no profound theoretical background on optimal training/test split ratio, a recent study had suggested that optimal training/test ratio with p features is approximately √ p:1 (43). Since our study has 211 features, the optimal ratio would have been ∼14.51:1, but we followed the precedent of previous studies on AI development with clinical purposes (44)(45)(46). Then we separated the training set into four groups for 4-fold cross-validation, rotationally using each set for validation and the rest for training ( Figure 1). We conducted 9:1 splitting to secure the number of cases used for training and cross-validation. Finally, we modified the value of selected hyperparameters for each algorithm to optimize the model performance. The optimal value of each hyperparameter is shown in Supplemental material 3.
Feature importance, which is defined as 'the increase in the model's prediction error after permuting the feature, was    calculated for each variable included in the prediction model (47,48). Global diagnostic accuracy and F1-score were measured to evaluate the developed prediction model. Global diagnostic accuracy was defined as the proportion of correct classification (Equation 1). Precision is defined as the proportion of true positive among samples classified as true (Equation 2). Recall, or sensitivity is defined as the proportion of true positive among positive samples (Equation 3). F1-score, which is a harmonic mean of precision and recall, shows the model performance of the developed model (Equation 4). After the global test for AI performance, we calculated performance parameters (precision, recall, specificity, and F1-score) by disease category.
Multi-comparison table for each classifier was constructed for further comparison of model performance.

Ethics approval
The institutional review board of Da Nang Hospital, Da Nang, Vietnam reviewed and approved the study protocol (Supplemental material 4). In addition, we acquired informed consent from all participants of this study. All procedures were contributing to this work to comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the 1975 Declaration of Helsinki, which was revised in 2008.

Participant characteristics
We recruited a total number of 1,129 participants for the survey.  There were significant differences in patterns of symptoms by disease entity. In mosquito-borne disease, fever/chill, fatigue, headache, anorexia, diarrhea, and myalgia were common. Gastrointestinal tract symptoms such as diarrhea, constipation, and abdominal pain were common in acute gastroenteritis. Systematic symptoms such as fever/chill and fatigue were the most common in sepsis and viral hepatitis, with no other prominent symptoms present (Table 1). Figure 2 presents the top 20 influential features of the global model with feature importance values. The X-axis indicates the relative feature importance, and the y-axis represents the names of the feature. Headache and its duration were the most influential variables, followed by cough, fever, and pulse rate. Feature importance of variables by disease entity is presented in Supplemental material 5. Table 2 shows the global diagnostic accuracy and F1score of developed AI models. The model developed by the CatBoost algorithm showed the highest global diagnostic accuracy of 87.61%, and the model developed by the XGBoost algorithm showed the lowest global accuracy of 83.18%. We conducted 4-fold cross-validation on the model developed by the CatBoost algorithm, the model with the highest global diagnostic accuracy. Results from 4-fold cross validation also showed that diagnostic accuracy was the highest in CatBoost classifier: mean diagnostic accuracy of XGBoost, LightGBM, and CatBoost classifier were 0.832 (standard deviation [SD] 0.019), 0.839 (SD 0.009), and 0.850 (SD 0.019), respectively. Mean F1-score of XGBoost, LightGBM, and CatBoost classifier were 0.833 (SD 0.018), 0.840 (SD 0.009), and 0.850 (SD 0.019), respectively.

Performance comparison of developed artificial intelligence
Multi-comparison table for each classifier showed that performance of CatBoost algorithm was relatively higher compared to XGBoost and LightGBM. False prediction rates of XGBoost, LightGBM and CatBoost classifier were 16.81, 16.81, and 11.76, respectively. (Table 3) The result from multi-comparison table is  concurrent with Table 2, which shows higher global accuracy in CatBoost classifier compared to other two classifiers.
Parameters on AI performance by disease category are shown below in Table 4. Precision and recall were the   lowest in the "sepsis" category and the highest in the "CNS infection" category. On the other hand, precision, recall, and sensitivity were generally higher in AI developed by the CatBoost algorithm, with exceptions such as the "mosquito-borne disease" category and the "CNS infection" category.

Discussion
In this pilot study, we achieved around 85% of global diagnostic accuracy with AI developed with limited data, only including medical histories, physical examination results, and symptom assessments. The global accuracy increases up to 90% after excluding sepsis, which is a condition that requires a complex diagnostic procedure for accurate assessment (49). The accuracy we have achieved in this study is reasonably high, considering that the prediction model developed in this study did not include any laboratory test results or radiologic findings. For instance, machine learning algorithms for predicting hepatitis C in patients enrolled in National Treatment Program of HCV patients in Egypt showed accuracy of 66-84.4%, while our CatBoost model showed accuracy of 88% (50). Accuracy of predicting pulmonary tuberculosis was 87% in our model developed by CatBoost algorithm, while previous models based on chest X-ray showed area under curve of 0.75-0.99 (51). Considering that artificial intelligence in this study relied solely on survey questionnaire, diagnostic accuracy of the models developed in this study is relatively high compared to previous studies.
The global diagnostic accuracy of AI developed in this study is relatively low compared to other AIs, usually achieving 90% of higher diagnostic accuracy (12,52). However, these AIs typically rely on additional tests such as radiologic studies, laboratory tests, and pathologic results. Our study developed a survey-based AI without additional tests except for vital sign assessments and physical examination, making it applicable to resource-limited settings (53).
While the developed AI cannot be a replacement for the clinical decision-making process, we could use it for screening tests and initial disease evaluation under circumstances of insufficient medical expertise, which is a common condition in LMICs (53, 54). However, there are significant challenges in applying AI to LMICs, mostly from local governance capacity and AI literacy (31,55). Using tailored AI with high costefficacy and collaboration with experts in AI development and management will effectively screen and manage the disease of interest. Our study showed that simplified survey-based AI provides certain benefits in detecting and controlling infectious diseases in the Central region of Vietnam. We are planning to improve the diagnostic accuracy of AI further and evaluate the cost and efficacy of the developed AI by applying it to multiple hospitals in Hue, Vietnam, and Da Nang, Vietnam.
It is one of few academic studies on the development and application of AI in LMICs. The study is a product of international collaboration, including epidemiologists, global health experts, professional programmers, and local physicians in Vietnam, facilitating questionnaire development, data collection, and AI development. As this study is a pilot study, there are several limitations. First, we only used data from a single hospital at Da Nang, Vietnam, so additional validation is required before applying it to other regions. Due to the decrease in the number of patients visiting the hospital, a certain proportion of patients were recruited retrospectively via telephone survey. Although we tried to keep data integrity by reviewing EMR, data validity of retrospective data might have affected the result. Finally, our AI could not diagnose infectious diseases without definite clinical manifestation, such as sepsis. To correctly identify complex diseases and syndromes, we should include in-depth assessments of symptoms and clinical features in the model. To address these shortcomings, we plan to develop assessment questionnaires further, distribute the AI to multiple collaborating hospitals and healthcare centers in Vietnam and assess the efficacy of the AI in collaborating institutions.

Conclusion
This study is one of few academic studies on AIs in resourcelimited settings. Our results implied that even survey-based questionnaires without laboratory or radiologic tests could be beneficial in screening infectious diseases in LMIC. Additional studies on other collaborating institutions will further develop and validate the current model we have developed and provide epidemiologic evidence on the effectiveness of AI application in a resource-limited setting.

Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethics statement
The studies involving human participants were reviewed and approved by the institutional review board of Da Nang Hospital. The patients/participants provided their written informed consent to participate in this study.

Author contributions
Concept and design: KK, M-kL, and SK. Drafting of the manuscript: KK and SK. Statistical analysis: KK, HL, and BK. Obtained funding: M-kL and HS. Administrative, technical, or material support: M-kL, HS, HL, and SK. Supervision: M-kL and SK. All authors had full access to all the data in the study, takes responsibility for the integrity of the data and the accuracy of the data analysis, acquisition, analysis, interpretation of data, and critical revision of the manuscript for important intellectual content.

Funding
This study was funded by RIGHT Fund (Investment ID RF-TAA-2021-H01). However, the funding source had no role in designing and conducting the study.