Deep learning assisted diagnosis system: improving the diagnostic accuracy of distal radius fractures

Objectives To explore an intelligent detection technology based on deep learning algorithms to assist the clinical diagnosis of distal radius fractures (DRFs), and further compare it with human performance to verify the feasibility of this method. Methods A total of 3,240 patients (fracture: n = 1,620, normal: n = 1,620) were included in this study, with a total of 3,276 wrist joint anteroposterior (AP) X-ray films (1,639 fractured, 1,637 normal) and 3,260 wrist joint lateral X-ray films (1,623 fractured, 1,637 normal). We divided the patients into training set, validation set and test set in a ratio of 7:1.5:1.5. The deep learning models were developed using the data from the training and validation sets, and then their effectiveness were evaluated using the data from the test set. Evaluate the diagnostic performance of deep learning models using receiver operating characteristic (ROC) curves and area under the curve (AUC), accuracy, sensitivity, and specificity, and compare them with medical professionals. Results The deep learning ensemble model had excellent accuracy (97.03%), sensitivity (95.70%), and specificity (98.37%) in detecting DRFs. Among them, the accuracy of the AP view was 97.75%, the sensitivity 97.13%, and the specificity 98.37%; the accuracy of the lateral view was 96.32%, the sensitivity 94.26%, and the specificity 98.37%. When the wrist joint is counted, the accuracy was 97.55%, the sensitivity 98.36%, and the specificity 96.73%. In terms of these variables, the performance of the ensemble model is superior to that of both the orthopedic attending physician group and the radiology attending physician group. Conclusion This deep learning ensemble model has excellent performance in detecting DRFs on plain X-ray films. Using this artificial intelligence model as a second expert to assist clinical diagnosis is expected to improve the accuracy of diagnosing DRFs and enhance clinical work efficiency.


Introduction
Distal radius fracture (DRF) is one of the most common fractures, accounting for about 20% of all fractures and about 17-25% of emergency cases (1,2).DRFs are more common in middle-aged and elderly patients with osteoporosis, typically resulting from low-energy injuries like falls, while young patients are mostly caused by highenergy injuries (3).With the increasing severity of aging, the incidence of the disease also shows an increasing trend year by year (4).
As a fast and inexpensive imaging tool, X-ray examination is the preferred method for evaluating wrist injuries (5).However, in emergency departments or outpatient clinics, non-orthopedic surgeons or young radiologists are often the first doctors to evaluate X-rays and need to make urgent decisions.Unfortunately, misdiagnosis (especially missed fractures) is common in these scenarios due to factors such as heavy workloads, fatigue, and lack of experience (6).
If patients' fractures are not diagnosed in a timely manner, it may lead to delayed treatment, malunion and osteoarthritis, which will seriously affect their functional recovery, reduce their ability to live independently, and lower their quality of life (7).Especially for elderly patients, due to poor physical fitness and decreased body tissue function, if fractures do not get treated in time, the prognosis is more likely to be adversely affected (8).Therefore, accurate and efficient assistance technology for automated fracture detection has become a focus of attention.
Deep learning is a machine learning method, which realizes automatic feature extraction and expression by constructing a multilayer neural network, so as to complete tasks such as data classification or regression (9,10).Compared with traditional machine learning methods, deep learning does not require manual selection of features.Instead, it automatically learns features through neural networks, which greatly reduces the difficulty and complexity of feature engineering (11,12).In recent years, with the progress in various aspects such as big data, high-performance computing, and algorithm optimization, deep learning has become a leading machine learning technology.
Image segmentation, object detection, and task classification based on deep learning technology have been successfully applied in the field of medical images, bringing new opportunities for the establishment of computer-aided medical imaging diagnosis systems.Deep learning techniques have been widely used in image analysis for a variety of diseases, including the detection of skin cancer, diabetic retinopathy, abnormal thyroid tissue abnormalities, and lung nodules (13)(14)(15)(16).In recent years, deep learning has also been successfully applied to the identification and severity assessment of bone and joint lesions, such as knee joint lesions, osteoarthritis, and spinal degenerative lesions.In addition, some deep learning-based models have been used to assess bone age (17).Recently, more and more studies have shown that artificial intelligence models based on deep learning have great potential in fracture recognition, classification, segmentation, and visual interpretation (18)(19)(20).These model can significantly improve the accuracy of diagnosis, treatment, and prognosis evaluation, indicating its enormous potential applications in fractures.
In this study, we constructed a deep learning model to diagnose distal radius fractures using wrist AP and lateral X-ray images.

Patients
This study was approved by our institutional ethics committee (IRB No.0840), and the requirement for informed consent was waived due to the retrospective nature of the study and negligible risks.All methods of this study were conducted in accordance with the Declaration of Helsinki.
Eligible patients were screened from the database of Union Hospital Affiliated to Tongji Medical College of Huazhong University of Science and Technology.Ultimately, 1,620 patients with DRFs who received treatment at the hospital from January 2014 to January 2023 were included.The inclusion criteria for patients are: (1) age over 18 years old; (2) diagnosed with DRF; (3) received X-ray examination.The exclusion criteria for patients include: (1) age under 18 years old; (2) old fractures, pathological fractures, and re-fractures after internal fixation; (3) foreign objects, such as plaster, jewelry, and clothing, that affect the final image judgment.Bilateral distal radius fractures were not our exclusion criteria.Furthermore, 1,620 patients without fractures whose diagnosis was sprain or carpal tunnel syndrome were included.Finally, all patients included in the study were divided into training set, validation set and test set according to the ratio of 7:1.5:1.5.The detailed inclusion process is shown in Figure 1.

Diagnosis and annotation
The diagnosis of DRFs were mainly based on patients' anteroposterior (AP) and lateral radiographs of wrist joint and combined with patients' medical history.In some cases, computerized tomography (CT) were also used for a more comprehensive analysis.Two chief physicians, one from the orthopedics department and the other from the radiology department, with over 15 years of experience each, collaborated to make the diagnosis.In the case of disagreement, another chief orthopedic expert with 20 years of clinical experience would discuss with the two physicians and reached a final conclusion.
All obtained wrist joint radiographic images were saved in highquality Joint Photographic Experts Group (JPEG) format.And, in order to protect the privacy of patients, all personal information (including name, gender, age and identity) on X-rays were hidden in the final obtained image.
The region of the distal radius was drawn as a region of interest (RoI) using the rectangle tool, and was manually labeled according to the final imaging diagnosis (labels were divided into two categories: fracture and normal).We use the Labelme software package1 for manual labeling.Figure 2 provides a detailed illustration of our labeling process.

Date set
Finally, we extracted a total of 3,276 images of the distal radius X-ray AP positions (1,639 fractures and 1,637 normal) and 3,260 images of the distal radius X-ray lateral positions (1,623 fractures and 1,637 normal) from the Picture Archiving and Communication System (PACS) at Union Hospital, covering a period from January 2014 to January 2023.A total of 16 lateral fracture radiographs were excluded because these images were taken in non-standard lateral position, possibly due to severe wrist pain that prevented standard positioning.These images were used for training, validation, and testing.There are a total of 2,296 AP view images in the training set (1,149

Image processing
To improve the performance of object detection, we need to preprocess the data.To be more specific, we scaled the input image to have a short edge of 800 pixels and adjust the size of the long edge proportionally.At the same time, each image has a 50% chance of being horizontally flipped, increasing the richness of the images.In addition, we normalize the images and improve model training stability by converting the original images into a standard format through a series of transformations.The specific normalization parameters used are as follows: mean = [123.675,116.28, 103.53] and std = [58.395,57.12, 57.375].

Development of deep learning ensemble model
We chose three different types of deep learning models (Figure 3): one-stage RetinaNet (21), two-stage Faster RCNN (22), and multistage Cascade RCNN (23).As a first-stage algorithm, RetinaNet directly classifies and regresses the entire image to generate object detection results, with features such as fast speed and low computational complexity.Besides, RetinaNet's design also incorporates Feature Pyramid Network (FPN), which can effectively handle objects of different scales, further improving detection accuracy and robustness (24).Faster R-CNN's workflow is mainly divided into two stages: regional recommendation and target classification.In the Region Proposal phase, a new neural Network structure called Region Proposal Network (RPN) is introduced, which generates candidate regions on the input image by sliding windows.In the target classification phase, each candidate box is converted to a fixed-size feature graph by RoI Pooling the RPN-generated candidate boxes, and then entered into a In the second stage, a cascade classifier method is adopted to cascade a series of classifiers together, with each classifier being stricter than the previous one, for further filtering of candidate boxes and selecting more accurate positive samples.In the third stage, a regression network is used to fine-tune the filtered candidate boxes to further improve detection accuracy (26).
Based on the three pre-trained models, we developed an ensemble model combining multiple deep learning algorithms to judge whether the input X-ray image of the distal radius is fractured or not.When at least two models considered fracture/normal, a joint diagnosis opinion was produced, and the average probability of fracture was calculated (Figure 4).We used the Ubuntu 16.04 operating system2 to run the PyTorch deep learning framework in an environment equipped with NVIDIA V100 GPU (CUDA version 10.2, cuDNN version 7.6.5), 3 and 32 GB of Video Random Access Memory (VRAM).

Evaluation of deep learning performance
An independent test set was used to test the performance of trained deep learning models and evaluate its ability to recognize fractures in X-ray images.Evaluated the diagnostic performance of the models in three types of radiographic images: AP + lateral view, AP view, and lateral view (Figure 5A), and assess the final clinical diagnosis results for the wrist joint unit (each wrist joint has one AP image and one lateral image) (Figure 5B).Different score probability thresholds were set for the trained deep learning model to draw the Receiver Operating Characteristic (ROC) curve, and the area under the curve (AUC) was calculated to evaluate the performance of the

Assessment of diagnostic performance by medical personnel
To evaluate the diagnostic performance of medical personnel, we established an orthopedic diagnosis team consisting of three attending orthopedists and a radiology diagnosis team consisting of three attending radiologists.All the included orthopedic attending physicians have at least 3 years of experience in trauma orthopedics and possess professional X-ray image reading skills.Attending radiologists included had at least 3 years of professional experience in radiological diagnosis.None of the above physicians participated in data collection and labeling.
Participating physicians were informed to perform independent analysis and diagnosis of the data in the test set.Diagnostic tests were performed by shuffling the test set data using a randomization procedure. 4To ensure consistent conditions between the deep learning model and the physician, the physician was not informed of the injury mechanism and patient age during the entire process of test.In addition, in order to avoid any intra-group or inter-group influence, all participants were not informed the research plan and diagnostic tests were conducted separately.
Finally, we calculated the average accuracy, sensitivity and specificity of each group of physicians in diagnosis and compared their performance with that of the deep learning model.If any one of the anteroposterior or lateral images is diagnosed as fractured, the wrist joint is diagnosed as DRF; if no fractures are detected in both the anteroposterior and lateral images, the distal radius region is judged as unfractured.

Statistical analyses
Continuous variables were presented as median [interquartile range (IQR)].Categorical variables were expressed by counts and percentages.For the comparison of baseline characteristics among different datasets, the ANOVA test was used for continuous variables and χ 2 test was used for categorical variables.Accuracy, sensitivity, and specificity were selected as diagnostic performances, and the corresponding 95% confidence intervals were estimated using bootstrapping with 1,000 bootstraps.The ANOVA test was used to compare diagnostic performances of ensemble model, orthopedists and radiologists.The bootstrapping was performed using packages "boot" of R 4.1.2(The R Foundation for Statistical Computing, Vienna, Austria).Other statistical analyses were performed using SAS Statistics software (version 9.4, SAS Institute Inc., Cary, North Carolina, United States).All statistical tests were two-sided, and p < 0.05 was considered statistically significant.

Demographic data of patients
The age difference between the fractured and non-fractured groups was not statistically significant (p = 0.433), while there is a statistically significant gender difference between the two groups (p < 0.001).In addition, there were no significant differences in age (p = 0.619) or gender (p = 0.817) among the training set, validation set, and test set.Detailed statistical analysis results are provided in Tables 1, 2.

Performance of the deep learning models
After training, the algorithm is able to use the previously learned features to diagnose images in the test database.If the diagnosis result is DRF, a red rectangle will be displayed on the suspicious area and the predicted probability value will also appear (as shown in Figure 6).
Evaluated the deep learning diagnostic models with the test set, and the ROC curves of RetinaNet, Faster RCNN, and Cascade RCNN are shown in Figure 7.The AUC of RetinaNet for diagnosing fractures on the test set is 0.9706, with an AUC of 0.9780 for AP images and an AUC of 0.9631 for lateral images.The AUC of Faster RCNN for diagnosing fractures on the test set is 0.9658, with an AUC of 0.9761 for AP images and an AUC of 0.9556 for lateral images.The AUC of Cascade RCNN for diagnosing fractures on the test set is 0.9644, with an AUC of 0.9786 for AP images and an AUC of 0.9500 for lateral images.
When the maximum value of Youden's index is 91.41%, the corresponding optimal score threshold for RetinaNet is 0.71.When the maximum value of Youden's index is 91.41%, the corresponding optimal score threshold for Faster RCNN is 0.65.When the maximum value of Youden's index is 90.79%, the corresponding optimal score threshold for Cascade RCNN is 0.66.The detailed results of accuracy, sensitivity, and specificity diagnosed by three deep learning models are shown in Table 3.
Compared with RetinaNet, Faster RCNN, and CASCADE RCNN, the ensemble model performed better on the test set, with an accuracy of 97.03% (95.71-97.96%),a sensitivity of 95.70% (93.44-97.13%)and a specificity of 98.37% (96.73-99.18%).The ensemble model outperformed the individual models with higher accuracy and lower standard deviation (Figure 8).When the diagnostic performance of DRFs were counted in units of wrist joints, the accuracy, sensitivity, and specificity reached 97.55% (95.71-98.57%),98.36% (95.90-99.59%),and 96.73% (93.47-98.37%),respectively, all of which were superior to a single model (Table 3).The detailed results can be seen in the confusion matrix of Figure 9.We therefore use this deep learning ensemble model for DRFs detection.

Comparison of deep learning models and clinical physicians
The results of the ensemble model were compared with those of the clinician group.The results are shown in Table 5.The ensemble model outperformed the orthopedic and radiologist groups in terms of diagnostic accuracy, sensitivity, and specificity.When collecting statistics on the wrist joint, the ensemble model still outperformed the performance of the orthopedic surgeon group and the radiologist group in terms of diagnostic accuracy, sensitivity, and specificity.

Discussion
The wrist joint is one of the most important joints in the body, with high frequency of movement, and a relatively high requirement for functional recovery if injured (2).The misdiagnosis or delayed treatment of DRFs can cause traumatic arthritis of the wrist joint, which can seriously affect the function of the hand.Especially for elderly people, the recovery after a bone fracture is relatively slow.If not diagnosed or treated in a timely manner, it may lead to adverse consequences such as weakness, deformity, shortening, stiffness, pain, and limited mobility of the wrist joint, thereby affecting the quality of daily life (3,8).It can also have a certain negative impact on mental health, which may increase the psychological burden and anxiety of elderly patients, such as anxiety, depression, and loss of independence (4).Therefore, timely and accurate post-fracture diagnosis is crucial to the treatment and rehabilitation.In this study, we constructed an ensemble model consisting of three different deep learning algorithms for the detection of DRFs.Our research confirmed that the trained and integrated model demonstrates excellent performance in distinguishing fractured or unfractured in the structure of the distal radius.The overall dignostic accuracy of the model has reached 97.03%, the sensitivity 95.70%, the specificity 98.37%.These results were better than the performance of orthopedic attending physicians and radiology attending physicians.In the diagnostic analysis of subdivided AP or lateral radiographs, it is also significantly better than the attending physicians in orthopedics and radiology.When using the wrist joint as a unit and simultaneously inputting two X-ray images of the AP and lateral positions for comprehensive diagnosis, the accuracy rate can reach 97.75%, with sensitivity and specificity of 98.36 and 96.73%, respectively, which is better than that of physicians in orthopedics and radiology.
Wrist X-ray examination in the AP and lateral views are the most commonly used imaging examination for diagnosing DRFs.However, the misdiagnosis of fractures in radiology is a common problem for non-specialist physicians or radiology resident doctors, especially in emergency environments, which can easily lead to extra harm or delayed treatment for patients (27).According to relevant studies, misdiagnosis of fractures accounts for 24% of harmful diagnostic errors in the emergency department, and misdiagnosis of hand and wrist fractures accounts for 29% of all misdiagnosed fractures (28).In addition, for patient admissions during night shifts, inconsistent imaging diagnosis opinions of fractures are more common, which may be related to non-expert reading and fatigue (29).
Deep learning is a branch of artificial intelligence that trains models by inputting data such as images, text, or sound, and enables models to learn to perform more complex classification tasks (30).Compared to traditional machine learning methods, deep learning has higher performance.In the field of medical image analysis, trained deep learning algorithms can simulate clinical doctors' judgments and accurately detect fractures (31).Deep learning algorithms for fracture detection offer significant advantages in clinical settings.Firstly, AI can be an effective tool for triage in emergency situations.AI can perform preliminary screening and discover positive results, which can allow doctors to prioritize the imaging data with fracture signs to reduce adverse effects caused by delayed diagnosis.Secondly, Under the test set, our ensemble model (red line) shows higher accuracy and stability than RetinaNet (blue line), Faster RCNN (orange line), and Cascade RCNN (green line).This study uses the bootstrap method.The X-axis and Y-axis, respectively, represent the number of times and accuracy of each resampling.Previous researches investigated the feasibility of using deep learning to detect fractures on X-ray films and showed good results, which is consistent with our study (Table 6).Zech et al. (17) trained the Faster RCNN model for the detection of carpal fractures in children groups, and reached an accuracy of 88%, a sensitivity of 88%, Confusion matrix.(18)(19)(20)36).However, we must be aware that algorithms still inevitably produce diagnostic errors and potential medical risks.Therefore, it is currently best to use AI as a second expert to assist clinicians in making a diagnosis, rather than replacing doctors for the final diagnosis.
Our study also has some limitations.(1) This is a retrospective study, and all imaging examinations did not have complete clinical medical records.During the testing process, the participating doctors diagnosed only through imaging data, just like deep learning models.However, in a real clinical environment, non-radiologists can examine patients physically and obtain detailed medical history information, while radiologists can also access patient medical records to identify areas of concern, which, combined this with X-ray films, improves the sensitivity and specificity of fracture diagnosis.Therefore, the results of the physician group in this study only represent the level achieved when diagnosing based on imaging data merely, and cannot represent the diagnostic level in a completely real clinical environment.(2) The dataset in the database only contains data from a single medical center.Although the dataset is large and experimental results demonstrate excellent performance of the model, obtaining more images from different medical institutions would increase the diversity of data sources, which may further improve the reliability of the results.(3) The ensemble model we trained can be used for detecting DRFs, but cannot further classify the type of fractures.Accurately determining the type of fracture is also important for treatment as different types require different treatment plans.In the future, we will further develop related models to achieve intelligent classification of DRFs and provide assistance in determining more accurate treatment plans.Although these factors may affect the performance of our detection model, but our research results are still worth serious consideration.This fast, accurate, and intelligent fracture detection algorithm can be used by junior doctors in emergency rooms and outpatient clinics to assist clinical diagnosis.This not only helps reduce clinical workloads, but also the risk of misdiagnosis.

Conclusion
We have developed an ensemble model based on deep learning algorithms for detecting DRFs and demonstrated excellent diagnostic performance.The results of this study demonstrate the feasibility of the fracture detection technology based on deep learning, and will contribute to further research on fracture detection in more types and locations in the future.At the same time, we will build larger datasets in the future and train advanced algorithms to achieve automatic detection of DFR and intelligent determination of fracture types.In summary, this fast and accurate diagnostic tool is expected to become the second expert for doctors in clinical practice, improving the accuracy of diagnosis of DRFs and reducing their burden in clinical work.
(1) To train and evaluate the performance of DRF diagnosis using the first-stage model Faster R-CNN, the two-stage model RetinaNet, and the multi-stage model Cascade RCNN.(2) To build an expert-assisted system based on deep learning algorithm ensemble model through algorithm fusion to further improve the diagnostic accuracy of DRF and reduce misdiagnosis and missed diagnosis.(3) To compare the difference in diagnostic performance between the deep learning integrated model and clinical doctors.The results of this study further confirm the feasibility of deep learning-based assisted reading technology in clinical diagnosis.This technology is expected to provide a new, accurate and efficient aid for the clinical diagnosis of DRF.

FIGURE 4
FIGURE 4Deep learning ensemble model structure diagram.

FIGURE 5
FIGURE 5Two detection modes of deep learning detection model (A) When inputting a single front view or lateral X-ray images, judge whether there is a fracture in the distal radius region according to a single radiograph; (B) When the AP and lateral X-ray images of the ipsilateral wrist joint of a patient are inputted at the same time, the comprehensive diagnosis concerning the distal radius region is made based on the diagnostic results of the two images.If any one of the anteroposterior or lateral images is diagnosed as fractured, the wrist joint is diagnosed as DRF; if no fractures are detected in both the anteroposterior and lateral images, the distal radius region is judged as unfractured.

FIGURE 7 ROC
FIGURE 7 ROC curves output by three deep models during testing: (A) ROC curve output by RetinaNet; (B) ROC curve output by Faster RCNN; (C) ROC curve output by Cascade RCNN.(Four digits after the decimal point are kept for this experiment to ensure data precision).

FIGURE 6
FIGURE 6Part of output X-rays dignose results.The above figures show some of the output results in the test dataset.The algorithm used red rectangles to mark suspicious fractures and provided corresponding fracture prediction probability values.

FIGURE 9
FIGURE 9 with fractures, 1,147 normal) and 2,283 lateral view images (1,136 with fractures, 1,147 normal).The validation set contains a total of 491 AP view images (246 with fractures, 245 normal) and 488 lateral view images (243 with Flowchart of the entire research process.

TABLE 1
Clinical information of included patients (Diagnosis-based classification).

TABLE 2
Clinical information of included patients (Divide according to the dataset).

TABLE 3
Diagnostic performance of each deep learning model.

TABLE 4
Diagnostic performance of medical personnel.

TABLE 5
Comparison between the ensemble model and physicians.

TABLE 6
Summary of the performance of DL in fracture diagnosis.