Usefulness of Machine Learning for Identification of Referable Diabetic Retinopathy in a Large-Scale Population-Based Study

Purpose: To development and validation of machine learning-based classifiers based on simple non-ocular metrics for detecting referable diabetic retinopathy (RDR) in a large-scale Chinese population–based survey. Methods: The 1,418 patients with diabetes mellitus from 8,952 rural residents screened in the population-based Dongguan Eye Study were used for model development and validation. Eight algorithms [extreme gradient boosting (XGBoost), random forest, naïve Bayes, k-nearest neighbor (KNN), AdaBoost, Light GBM, artificial neural network (ANN), and logistic regression] were used for modeling to detect RDR in individuals with diabetes. The area under the receiver operating characteristic curve (AUC) and their 95% confidential interval (95% CI) were estimated using five-fold cross-validation as well as an 80:20 ratio of training and validation. Results: The 10 most important features in machine learning models were duration of diabetes, HbA1c, systolic blood pressure, triglyceride, body mass index, serum creatine, age, educational level, duration of hypertension, and income level. Based on these top 10 variables, the XGBoost model achieved the best discriminative performance, with an AUC of 0.816 (95%CI: 0.812, 0.820). The AUCs for logistic regression, AdaBoost, naïve Bayes, and Random forest were 0.766 (95%CI: 0.756, 0.776), 0.754 (95%CI: 0.744, 0.764), 0.753 (95%CI: 0.743, 0.763), and 0.705 (95%CI: 0.697, 0.713), respectively. Conclusions: A machine learning–based classifier that used 10 easily obtained non-ocular variables was able to effectively detect RDR patients. The importance scores of the variables provide insight to prevent the occurrence of RDR. Screening RDR with machine learning provides a useful complementary tool for clinical practice in resource-poor areas with limited ophthalmic infrastructure.


INTRODUCTION
Diabetes mellitus affects 463 million adults and consumes 1.8% of gross domestic product globally, posing a huge burden on healthcare systems, especially in remote, underserved areas (1). Diabetic retinopathy (DR) is a vision-threatening condition that affects 22.27% of adults with diabetes (2). With the diabetes pandemic spreading from wealthy industrialized countries to developing regions, the number of people with DR will increase from 103.12 million in 2020 to 160.50 million in 2045 (2). Visual impairment and blindness due to DR can be significantly reduced if diagnosed at an early stage and treated appropriately. However, due to the high cost and low accessibility of eye services, <70% of people with diabetes receive eye examinations at regular intervals (3,4).
The current strategy for detecting DR is based on clinical examination by an ophthalmologist or grading of retinal photographs via telemedicine, which relies on highly trained staff or requires expensive equipment. In addition, whether the recommended screening interval can be extended has attracted extensive debate because a large number of DR-negative patients receive repeated annual fundus screenings (5). It was estimated that the DR service would be reduced by 40% if people with no visible retinopathy at two consecutive screens received 2year rather than annual screening in the Scottish Diabetic Retinopathy Screening programme (6). The National Health Service Foundation Trust claimed that screening people with type 2 diabetes every 2 years, rather than annually, would reduce screening costs by 25% (7). Therefore, establishing simple, practical methods for identifying people at high risk of referable DR (RDR) based on easily accessible indicators has become an important goal, which will help to target screening and prevention (8,9).
Modeling for RDR is challenging because most medical data has a non-linear, non-normal, and non-independent distribution, and traditional regression analysis techniques would lose information (10). The use of machine learning (ML) techniques offers an alternative solution, which captures the non-linear relationship in data without prior assumption. Furthermore, ML is able to rank variables by importance. Previous studies have demonstrated that ML-based methods can accurately identify diabetes in the general population (11,12). However, limited studies based on ML for DR classification are available to date (13,14). To fill the knowledge gap, this study aims to develop RDR classifiers based on four ML techniques using simple non-ocular indicators and compare them with traditional logistic regression models to evaluate their usefulness in screening RDR in a large populationbased survey.

Data Source and Participants
This study is a secondary analysis based on the Dongguan Eye Study (DES), which is a large-scale population-based survey conducted in Guangdong Province, Southern China (15,16). The present study protocol was approved by the ethics committees of Guangdong Provincial People's Hospital. The study was performed in accordance with the Declaration of Helsinki. Written consent was obtained from all participants before entering the study.
The detailed methodology of the DES has been reported in previous articles (15,16). In brief, 11,357 eligible participants residing in Hengli Town, Dongguan City were recruited between September 2011 and February 2012, with 8,952 (response rate 78.8%) completing the systemic and ophthalmic examinations. Standardized questionnaires were used to obtain data on demographics, lifestyle, socioeconomic status, quality of life, and medical and family history. Height, weight, waist and hip circumference, and blood pressure were measured using standard protocols. Fasting venous blood was collected to obtain the following measurements: fasting blood glucose (FBG), hemoglobin A1c (HbA1c), total cholesterol (TC), triglycerides (TG), high-density lipoprotein cholesterol (HDL-C), low-density lipoprotein cholesterol (LDL-C), and blood uric acid (UA). All participants with diabetes or hypertension received a comprehensive ocular examination that covered visual acuity, automatic refraction, slit lamp, intraocular pressure, and retinal photography.

Definition of the Outcome
The diagnosis of diabetes is based on medical history and endocrinologists' records, the use of insulin therapy, oral hypoglycaemic drugs, or the latest criteria according to the Chinese Guidelines for the Management of Diabetes Mellitus in the Elderly (2021 Edition): typical symptoms of diabetes mellitus (irritable and excessive drinking, polyuria, polyphagia, and unexplained weight loss) plus random plasma glucose ≥ 11.1 mmol/L; FBG ≥ 7.0 mmol/L; 2-h plasma glucose level ≥ 11.1 mmol/L during a 75-g oral glucose tolerance test (OGTT); or HbA1c ≥ 6.5%. Re-testing on another day was performed to confirm the diagnosis.
The DR status was graded based on fundus photography according to the classification system designed by the Wisconsin Epidemiologic Study of Diabetic Retinopathy (WESDR) and the Early Treatment Diabetic Retinopathy Study (ETDRS) (17). Diabetic macular oedema (DME) was diagnosed according to the International Diabetic Macular Edema Severity Scale, defined as having significant retinal thickening or hard exudate in the posterior pole. Fundus fluorescein angiography (FFA) was performed to confirm the diagnosis in participants with suspected severe non-proliferative diabetic retinopathy (NPDR) or proliferative diabetic retinopathy (PDR), macular edema, retinal vasculopathy, posterior uveitis, or other retinochoroidal diseases. DR was categorized into no diabetic retinopathy, mild NPDR, moderate NPDR, severe NPDR, or PDR. RDR was adopted as the primary outcome in the present study, which was defined as the presence of moderate NPDR, severe NPDR, PDR, or DME. RDR is clinically essential because RDR people will be referred to ophthalmologists for review in a diabetic retinopathy screening programme, while those without RDR will continue to be screened in primary care (18).

Demographics and anthropometry
Age, sex, marital status, number of children, occupation, education level, economic status/income, height, weight, body mass index, waist circumference, hip circumference, waist/hip ratio, systolic blood pressure, diastolic blood pressure.

Disease history and comorbidity
Duration of diabetes, duration of hypertension, duration of hyperlipidemia, history of cardiovascular disease (CVD), anti-hypertension drugs, use of insulin, family history of diabetes, family history of hypertension, family history of diabetes.

Lifestyle and habitus
Smoke status, years of smoking, smoking amount, alcohol use, years of drinking, drinking amount

Biochemistry parameter
Fasting glucose, HbA1c, total cholesterol (TC), triglycerides (TG), high-density lipoprotein cholesterol (HDL-c), low-density lipoprotein cholesterol (LDL-c), serum creatinine (SCr), uric acid (UA), blood urea nitrogen (BUN), endogenous creatinine clearance rate (CCr), microalbuminuria (MAU) FIGURE 1 | Machine learning flowchart of this study. ML, machine learning; XGBoost, extreme gradient boosting; ANN, artificial neural network; AdaBoost, adaptive boosting; GBM, gradient boosting machine. Table 1 shows the potential variables included in the model, including age, gender, body mass index (BMI), waist-to-hip ratio (WHR), waist circumference, blood pressure, lifestyle, and medical and family history. Lifestyle information included smoking, alcohol consumption, and dietary habits. Those who smoked at least one cigarette a day for 6 months were defined as smokers, while those who drank alcohol at least once a week for 6 months were defined as drinkers. Duration of diabetes was defined as the time between the first diagnosis of diabetes by an endocrinologist and entry into this study. For newly diagnosed diabetes, diabetes duration was defined as 0 years. In addition, laboratory serum parameters were included, as these tests are routine performed in government health institutions in China.

Statistical Analysis
All analyses were performed in R software version 4.0.3. The distribution of demographic and clinical characteristics was presented using mean ± standard deviation (SD) for continuous variables and by number and percentage for categorical data. Differences between RDR and non-RDR patients were evaluated by using the independent t-test for continuous normally distributed variables, the Mann-Whitney test for non-normally distributed variables, and the chi-squared test for categorical variables. All tests were two-tailed, and p < 0.05 was considered to be statistically significant. Figure 1 shows the analytical framework for this study. Eight algorithms were used to construct models for detecting RDRs: extreme gradient boosting (XGBoost), random forest, naïve Bayes, k-nearest neighbor (KNN), AdaBoost, Light gradient boosting machine (GBM), artificial neural network (ANN), and logistic regression. ML techniques were able to calculate the importance of the variables, i.e., the effect of each variable on the generated model of statistical significance. To identify the most important features for diagnosing RDR (Table 1), we applied XGBoost, random forest, naïve Bayes, and KNN to rank the importance of the variables. The top 10 variables that were present in all four ML algorithms were entered into the subsequent model development. After data cleaning, the data were randomly divided into a training set and a test set (at an 80:20 ratio) to assess the reliability of these classifiers. To obtain the realistic and generalisable estimates as well as conservative confidence intervals, five-fold cross-validation and variance estimation were performed. Each model was fitted based on the training dataset, and its accuracy was assessed on the test dataset. The area under the curve (AUC) of receiver operating characteristic (ROC) curves was calculated to evaluate the performance of each model.   Table 1; Figure 3). Table 3 shows the discriminative performance of the algorithms using five-fold crossvalidation and an 80:20 ratio of training and validation. Figure 4 shows the performance of top-5 models. The XGBoost algorithm was nominally the best with an AUC of 0.816 (95%CI: 0.812, 0.820). The AUCs for logistic regression, AdaBoost, naïve Bayes, and Random forest were 0.766 (95%CI:

DISCUSSION
This study developed and validated an ML-based model for screening RDR in a Chinese population using common and readily available variables. After ranking the importance of the risk factors, the top 10 essential risk factors were adopted for modeling by eight ML models. The XGBoost classifier exhibited the best performance with an AUC of 0.816, which was validated in an independent population. To our knowledge, this is the first diagnostic model for RDR in the Chinese diabetic population based on ML and simple variables, which has the potential for accurate and rapid RDR screening.  State-of-the-art ML methods were adopted in this study. Traditional regression analysis relies on hypothesis-driven assumptions, while the ML techniques used do not require a predetermined assumption. This feature allows for data-driven exploration for non-linear patterns that predict risk for a given individual, i.e., precise risk stratification (10,19). As observed in this study, the ranking of the importance showed that the duration of diabetes, HbA1c, systolic blood pressure, TG, BMI, serum creatinine, age, education level, duration of hypertension, and income level were the 10 most important factors for RDR. Furthermore, the given ML algorithm requires only minimal input during the model development stage, which is particularly important given that ML models can easily incorporate new data to update and optimize, thereby continuously improving their discriminative performance over time (20). Our models provided information for DR screening in high-risk populations and can help to reduce the frequency of ocular examinations in low-risk populations (21).
Limited studies were available on risk stratification of DR based on ML and non-ocular parameters. Azizi-Soleiman et al. reported a model for detecting DR in Iranians based on outpatient clinical data (22). By training the data of 1,782 patients (without using cross-validation), the logit model obtained an AUC of 0.760 based on backward elimination as a feature selection strategy (22). Tsao et al. divided the clinical data of 536 patients in Taiwan into training and validation sets (at an 80:20 ratio), and compared the performance of four models (support vector machine, decision tree, ANN, and logistic regression) for DR detection, and found that support vector machine performed best with an AUC of 0.839 (14). Yao et al. reported that a back propagation artificial neural network outperformed logistic regression for DR detection with AUCs of 0.84 and 0.77, respectively (13). The abovementioned studies were based on hospital-based data, but population-based data are more relevant to the reality of DR screening programmes (5). This study applied ML techniques to population-based data and demonstrated their usefulness for RDR detection with similar AUCs to those in hospital-based studies.
The XGBoost algorithm, which has attracted attention in recent years due to its excellent performance and efficient training speed, performed best in this study. This model has been evaluated in several other ocular diseases. Oh et al. compared four ML models (support vector machine, C5.0, random forest, and XGboost) for detecting glaucoma and reported that XGboost performed best with an AUC of 0.945, accuracy of 0.947, sensitivity of 0.941, and specificity of 0.950 (23). Xu et al. demonstrated that the XGBoost classifier had the highest accuracies for predicting subretinal fluid absorption at 1, 3, and 6 months in patients with central serous chorioretinopathy (24). Wu et al. reported that the intraocular pressure in children with myopia treated with topical atropine can be predicted by using ML methods, and the XGBoost ranks the best predictive models (25). The present study confirmed that XGBoost is also a good tool for DR screening.
This study has several strengths. First, all variables were derived from easily accessible non-ocular examinations and questionnaires. The model is especially suitable for primary hospitals and diabetic clinics without the need for expensive laboratory tests and ocular specialists equipped with ophthalmic equipment, which is especially useful in areas of low socio-economic status and with limited health resources. Second, the model is derived from a large population-based survey in China, making it highly representative and generalisable. Third, the majority of previous studies divided smoking and alcohol consumption into only two categories (with or without), and therefore they do not reflect the effect of frequency and quantity on disease. The importance ranking analysis showed that the amount and duration of smoking and drinking were also important for RDR. Finally, the ranking of risk factors might provide insight into the prevention of DR. This study also has limitations. Only Chinese adults were included in the present study; however, ethnic variations in DR onset and progression have been confirmed in population studies (26,27). Therefore, this study needs to be repeated with other races. In addition, this study evaluated the feasibility and performance of ML, but not its implementation. However, a population-based study is especially suited to assessing the initial feasibility of ML algorithms in the real world.

CONCLUSION
In this secondary analysis of a large-scale population-based survey, we first extracted demographic variables, laboratory test results, and medical and family history, and then applied different ML algorithms to rank risk factors and for identification of RDR. The XGBoost algorithm achieved the best performance based on 10 simple variables. The usage of ML algorithms to rank epidemic risk factors (other than ophthalmic examinations) to identify referable patients will reduce the cost and had a high application valuable in resource-poor areas in China.

DATA AVAILABILITY STATEMENT
Data are available from the authors upon reasonable request and with permission of Guangdong Provincial People's Hospital.

ETHICS STATEMENT
Written informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.

AUTHOR CONTRIBUTIONS
CY and QL: conceptualization, investigation, formal analysis, and writing-original draft. HG: validation, resources, material support, administrative, and writing-review and editing. MZ: investigation, resources, material support, and administrative. LZ: investigation, material support, and review and editing. GZ: data analysis and review and editing. QM: project administration, conceptualization, investigation, supervision, formal analysis, and writing-review and editing. YC: project administration, conceptualization, investigation, supervision, formal analysis, and writing-review and editing. All authors contributed to the article and approved the submitted version.