Deep learning-based automated high-accuracy location and identification of fresh vertebral compression fractures from spinal radiographs: a multicenter cohort study

Background Digital radiography (DR) is a common and widely available examination. However, spinal DR cannot detect bone marrow edema, therefore, determining vertebral compression fractures (VCFs), especially fresh VCFs, remains challenging for clinicians. Methods We trained, validated, and externally tested the deep residual network (DRN) model that automated the detection and identification of fresh VCFs from spinal DR images. A total of 1,747 participants from five institutions were enrolled in this study and divided into the training cohort, validation cohort and external test cohorts (YHDH and BMUH cohorts). We evaluated the performance of DRN model based on the area under the receiver operating characteristic curve (AUC), feature attention maps, sensitivity, specificity, and accuracy. We compared it with five other deep learning models and validated and tested the model internally and externally and explored whether it remains highly accurate for an external test cohort. In addition, the influence of old VCFs on the performance of the DRN model was assessed. Results The AUC was 0.99, 0.89, and 0.88 in the validation, YHDH, and BMUH cohorts, respectively, for the DRN model for detecting and discriminating fresh VCFs. The accuracies were 81.45% and 72.90%, sensitivities were 84.75% and 91.43%, and specificities were 80.25% and 63.89% in the YHDH and BMUH cohorts, respectively. The DRN model generated correct activation on the fresh VCFs and accurate peak responses on the area of the target vertebral body parts and demonstrated better feature representation learning and classification performance. The AUC was 0.90 (95% confidence interval [CI] 0.84–0.95) and 0.84 (95% CI 0.72–0.93) in the non-old VCFs and old VCFs groups, respectively, in the YHDH cohort (p = 0.067). The AUC was 0.89 (95% CI 0.84–0.94) and 0.85 (95% CI 0.72–0.95) in the non-old VCFs and old VCFs groups, respectively, in the BMUH cohort (p = 0.051). Conclusion In present study, we developed the DRN model for automated diagnosis and identification of fresh VCFs from spinal DR images. The DRN model can provide interpretable attention maps to support the excellent prediction results, which is the key that most clinicians care about when using the model to assist decision-making.


Introduction
Vertebral compression fractures (VCFs) are a common clinical problem, with an annual incidence of approximately 1.5 million in the United States (Goldstein et al., 2015;Madassery, 2020).Osteoporosis, trauma, and neoplastic infiltration are the most common etiological causes of VCFs (Mauch et al., 2018).VCFs are associated with back pain, reduced mobility, and spinal kyphosis (Goldstein et al., 2015).In addition, VCFs may lead to a lower health-related quality of life and higher long-term mortality.Therefore, accurate detection and rapid diagnosis of this condition are essential to ensure that patients with the suspected disease receive timely treatment (Clark et al., 2016;Petritsch et al., 2017;Kim KC. et al., 2021).
Percutaneous vertebroplasty has been widely applied to treat osteoporotic VCFs; however, cement augmentation is only effective for fresh VCFs.Therefore, clinicians must diagnose patients with fresh VCFs as early as possible (differentiating fresh VCFs from normal vertebrae and old VCFs), and promptly evaluate the surgical strategy and etiology of fresh VCFs.Magnetic resonance imaging (MRI), particularly T2-weighted imaging with fat suppression (FS T2WI), is regarded as the most reliable imaging method for determining the presence of a fresh fracture owing to its excellent ability and sensitivity to detect vertebral hemorrhages and bone marrow edema in fresh VCFs (Kaup et al., 2016;Frellesen et al., 2018).However, the long appointment cycle, claustrophobia, and contraindications, such as metal implants, are common limitations of MRI (Chen et al., 2022).Furthermore, the high economic burden and discomfort due to loud noise may prevent its widespread application.
In clinical practice, digital radiography (DR) is the frontline imaging examination for the diagnosis of fractures, including limb and spinal fractures, due to it being the fastest and most accessible imaging modality (Kim KC. et al., 2021;Kim DH. et al., 2021).However, spinal DR primarily provide information about vertebral morphology and cannot detect acute bone marrow edema caused by fresh VCFs; therefore, determining fresh VCFs from spinal DR remains challenging for clinicians.Radiomics and deep learning (DL) models have achieved excellent performance for diagnosis and application (He et al., 2020;Mu et al., 2020).In addition, many DL models based on DR have been developed and used to achieve partial functions of MRI and echocardiography (Jeong et al., 2021).
To the best of our knowledge, a limited number of studies have focused on DL models for identifying fresh VCFs from DR.In this study, we developed the deep residual network (DRN) model that automated the identification of fresh VCFs from spinal DR images and explored whether it remains highly accurate for an external test cohort.

Study design and participants
This multi-institutional retrospective study included the eligible participants from five institutions.From November 2016 to July 2023, 1,747 participants who underwent spinal MRI and DR within 2 weeks of back pain were enrolled in this study.We excluded 3,858 participants, including those in whom with more than 1 week between DR and MRI examinations (n = 1,745) and those without complete raw data and qualified DR/MRI images (n = 2,113).Notably, participants with old VCFs, those with suspected malignant VCFs, and those who underwent surgical treatments, such as vertebroplasty, fixation, and fusion surgery, were not excluded to improve the robustness of our model.The study flowchart and detailed inclusion and exclusion are shown in Figure 1 and the Supplementary Appendix (p.3).
This study was approved by the Ethics Committees of the five hospitals and was conducted in accordance with the ethical principles of the Declaration of Helsinki.We used only preexisting medical data; therefore, the requirement for informed consent was waived for this retrospective study.The work was compliant with the STROCSS criteria (Mathew et al., 2021).

Radiological examination
The participants underwent routine anteroposterior and lateral lumbar spine radiography using DR instruments of six devices.All MR scans were acquired using a 1.5-T or 3.0-T MR system with a multichannel phased-array spine coil, and FS T2WI was performed using a turbo spin-echo sequence.The image acquisition parameters and details of the DR instruments of each institution are presented in the Supplementary Appendix (p.3).The DR and MR images were stored and retrieved using a picture archiving and communication system.

Ground truth
The whole spinal DR image, without the segmentation and delineation of region of interest (ROI), were labelled based on spinal MRI.FS T2WI is regarded as the standard reference for identifying fresh VCFs owing to its excellent ability to visualize and characterize high-signal areas and bone marrow edema (Genant and Jergas, 2003;Zhang et al., 2023).The sagittal MRI and lateral DR images were analyzed by at least one radiologist and one spinal surgeon.Any disagreements were resolved through consensus discussions.

Data partition
A total of 1,312 participants from three institutions were divided into a training cohort (n = 1,050; male: 416, 39.62%) and a validation cohort (n = 262; male: 118, 45.04%) based on the patient identification number (first 80% in the training cohort, the remaining 20% in the validation cohort) to train and validate the DRN model.A total of 221 participants (male: 100, 45.25%) and 214 participants (male: 87, 40.65%) from the other two institutions, were divided into YHDH and BMUH cohorts based on their respective institutions, and were categorized as external test cohorts to evaluate the external generalizability of the DRN model.

Model development
The DRN model was developed based on a previously reported 101-layer residual network (He et al., 2016), and trained on classified spinal DR images of patients in the training cohort.The architecture of the DRN included an input layer, a first layer (a 7 × 7 convolutional layer and an output channel of 64), and second to fifth layers (four residual modules).Each residual module comprised several residual blocks, a global average pooling layer (performing global average pooling on the output of the last residual module), and a fully connected layer (connecting the output of the global average pooling layer to a fully connected layer for the classification task).
The input images from the training cohort were resized to 224 × 224 pixels.We performed image augmentation and amplification of the original images using a horizontal flip to reduce overfitting.This normalization process allows the image data to converge smoothly during training, thereby improving the model's stability and training effectiveness.All loss values were summed and used to measure the model's performance.The DRN model was trained based on ImageNet pre-trained parameters.The Adam optimizer was utilized in this model, and the parameters to be optimized and the learning rate were passed to it.The Adam optimizer is an adaptive learning rate optimization algorithm that automatically adjusts the learning rate based on the gradients of the parameters, which results in more efficient updates of the model's weights.The learning rate was set to 0.0001.The batch size was 6, and the number of training epochs was 200.All the development processes were performed using the PyTorch framework.Detailed information on the development processes, parameters, software, and packages is presented in the Supplementary Appendix (p. 4).

Model test and evaluation
We assessed the diagnostic performance of the DRN model in both the validation and external test cohorts using the best model training weights and the same thresholds.In the present study, the DRN model was compared with previous competitive models, including 50-layer residual network (ResNet-50) (He et al., 2016), Shufflenet-v2 (Zhang et al., 2018), EfficientnetV2-S (Tan and Le, 2021), EfficientnetV2-M (Tan and Le, 2021), and EfficientnetV2-L (Tan and Le, 2021) to verify the effectiveness of the DRN model.The diagnostic efficacy of the models for identifying fresh VCFs were analyzed.The true positive (TP), true negative (TN), false positive (FP), false negative (FN), accuracy, balanced accuracy, sensitivity, and specificity of the different models were calculated.The TP event is defined as correct identified the VCFs from X-ray images.Receiver operating characteristic (ROC) curves were used to assess the sensitivity and specificity of the models, and the area under the ROC curves (AUC) was used to describe the discriminative power of the models.In addition, the macro-and micro-averaged ROC curves, precision recall (PR) ROC, and PR AUC were included in the evaluation metrics.

Subgroup analysis
Both fresh and old VCFs can lead to changes in vertebral morphology.Therefore, in clinical practice, even if there are changes in vertebral morphology observed on a participant's spinal DR, clinicians still find it challenging to accurately identify fresh VCFs solely through visual inspection.In order to assess the impact of the simultaneous presence of old VCFs on the diagnostic results of the DRN model, we grouped participants of the external test cohorts based on whether they had concomitant old VCFs, and the subgroup analysis was conducted to further evaluate the diagnostic efficiency of the model for fresh VCFs.

Visualization
Visualization experiments based on gradient-weighted class activation mapping (Grad-CAM) have been previously conducted to locate and identify the activation area over the input images to explain the feature learning and classification results of the DL models (Panwar et al., 2020;Liu C. et al., 2021).In addition, Grad-CAM technology uses gradient information flowing into the last convolution layer to assess the weight of each neuron in the final decision of the fresh VCFs (Mu et al., 2021).

Statistical analysis
All statistical analyses were performed using SPSS Statistics software version 22.0.(SPSS Inc., Chicago, IL, United States of America).and Python version 3.7.We used the bootstrap sampling to calculate the confidence interval directly (n_bootstrap = 1,000), and take the desired percentile of the bootstrap estimates as the bounds of the confidence interval.Normally distributed continuous variables were represented as means ± standard deviations and categorical variables as percentages (%).The Student's t-test and Mann-Whitney U test were used to compare continuous clinical variables, where appropriate.The chi-squared test or Fisher's exact test was used for categorical variables.Statistical significance was set at p-value <0.05.

Population characteristics
A totally of 1,747 participants (572 fresh VCFs and 1,175 non-fresh VCFs; 721 male and 1,026 female) with a mean age of 62.08 ± 13.92 years, were enrolled in this study (Supplementary Appendix p. 6).The training cohort from the three institutions included 1,050 participants (355 fresh VCFs and 695 non-fresh VCFs; 416 male and 634 female), with a mean age of 62.78 ± 13.50 years.In addition, there were 228 participants diagnosed with lumbar fresh VCFs in the training cohort.The validation cohort from the same three institutions included 262 participants (88 fresh VCFs and 174 non-fresh VCFs; 118 male and 144 female), with a mean age of 61.97 ± 13.05 years, and there were 32 participants diagnosed with lumbar fresh VCFs in the validation cohort.

Visualization experiments
As shown in Figure 2, we visualized the feature attention maps of different DL models using the Grad-CAM technique, which intuitively illustrated the effectiveness of our DRN model.The DRN model generated correct activation on the fresh VCFs and accurate peak responses on the area of the target vertebral body parts and demonstrated better feature representation learning and classification performance.The attention heatmap is designed to refine attention on the most distinctive region; however, ResNet-50, Shufflenet-v2 and EfficientnetV2-S generated wider, even erroneous, activation areas than the target vertebral body.
In addition, Grad-CAM demonstrates different feature attention maps for different predicted results of DRN model.The generated focused peak responses for images with FP predictions were similar to those of the TP images; however, the activated area corresponded to the erroneous vertebral body or the area outside of the vertebral body.The DRN model generated a low intensity activation area or no peak responses for TN and FN images (Figure 3).

Diagnostic performance of the DRN model
The AUC was 0.99, 0.89, and 0.88 in the validation, YHDH, and BMUH cohorts, respectively, for the DRN model for detecting and discriminating fresh VCFs (Figure 4).Moreover, the validation cohort had an accuracy of 96.62%, sensitivity of 93.18%, and specificity of 98.35%.The accuracies were 81.45% and 72.90%, sensitivities were 84.75% and 91.43%, and specificities were 80.25% and 63.89% in the YHDH and BMUH cohorts, respectively (Table 2).The detailed performance of the DRN in different cohorts is shown in the Supplementary Appendix (p. 8, 10).In addition, the results of the DRN with different parameters are shown in the Supplementary Appendix (p. 11,12).
In the present study, the DRN model was compared with previous DL models regarding the ability to discriminate fresh VCFs to evaluate its effectiveness.The AUC of ResNet-50, Shufflenet-v2, EfficientnetV2-S, EfficientnetV2-M, and EfficientnetV2-L were 0.89, 0.89, 0.15, 0.29, and 0.22, respectively, in the YHDH cohort and 0.84, 0.84, 0.52, 0.46, and 0.49, respectively, in the BMUH cohort (Figure 5).The accuracy, sensitivity, and specificity of the validation and external test cohorts Attention heatmaps of different models for representative participants.T1-weighted imaging (T1WI), T2WI, FS T2WI, DR image, and attention heatmaps of different models of participants who were predicted successfully in (A) fresh VCFs and (B) non-fresh VCFs group.The fresh VCFs are pointed by the red boxes.
for each model are listed in Table 2.The ROC curves in the validation cohort and the overall external test cohort for the models are shown in the Supplementary Appendix (p. 13, 14).

The influence of old VCFs on the performance of the DRN model
The influence of old VCFs on the performance of the DRN model in identifying fresh VCFs was assessed in the subgroups.Detailed demographics of each subgroup of the YHDH and BMUH cohorts are shown in Table 3.The AUC was 0.90 (95% confidence interval [CI] 0.84-0.95) and 0.84 (95% CI 0.72-0.93) in the non-old VCFs and old VCFs groups, respectively, in the YHDH cohort (p = 0.067).The AUC was 0.89 (95% CI 0.84-0.94)and 0.85 (95% CI 0.72-0.95) in the nonold VCFs and old VCFs groups, respectively, in the BMUH cohort (p = 0.051) (Table 4).The confusion matrix for each subgroup is shown in Figure 6, and attention heatmaps of DRN in old VCFs group of external test cohorts is shown in Figure 7.

Discussion
MRI is the imaging modality of choice for identifying fresh VCFs; however, its clinical utility is limited by many factors, including the  high cost and time involved.DL and machine learning (ML) models have been explored with the development of artificial intelligence and radiomics for the diagnosis and differential diagnosis of fresh VCFs (Chen et al., 2022).Several studies have shown the promising ability of DL and ML models to meet or surpass the performance of human experts on many survival prediction and medical imaging analysis tasks, involving multiple imaging modalities and application areas (Liu X. et al., 2021).However, DL and ML models have not been utilized in large-scale clinical practice.The gap between the success of these models in research and that in clinical settings must be addressed as an essential part of every DL model.At least two major factors hinder the translation of DL models into clinical practice: 1) the workload of manual segmentation, which can hardly be accomplished in a clinical workflow, and 2) the low external generalizability of the model to external institutions.In this study, we successfully overcame this resistance to clinical translation.We developed and evaluated the DRN model for the diagnosis of fresh VCFs from spinal DR images and found that the AUC for the overall model performance was greater than 0.88 in the external test cohorts.The attention heatmaps showed that the DRN model generated the correct activation of the target vertebral body parts and avoided manual segmentation.

Full automation and visualization
To the best of our knowledge, this study is the first to develop and test a fully automated DL diagnosis model for fresh VCFs based on spinal DR images and achieve visualization.The segmentation and delineation of ROI in DL or ML models require considerable time and effort.Therefore, achieving fully automated positioning and detection of fresh VCFs and reducing the complexity of segmentation is necessary.In previous studies on DL models, the diagnosis and detection of fresh VCFs from spinal DR images required segmentation of each vertebral body from the whole thoracic and lumbar vertebrae (Chen et al., 2022;Xu et al., 2023).
In present study, we achieved fully automated positioning and detected fresh VCFs with no ROI requirement using our DRN model.The local perception property and local feature extraction capability of four convolution layers endows the DRN model with an  In addition, we attempted to identify fresh VCFs in the C-arm DR images based on our models; however, we did not achieve accurate detection.Therefore, multicenter analyses with a large number of C-arm images of the spine are required to explore more stable and accurate DL models to determine the surgical segments and prevent surgical accidents.
In addition, many DL models are dedicated to segmenting bones and vertebrae from radiological images (Huang et al., 2020;Wennmann et al., 2022a).Klinder et al. (Klinder et al., 2009) reported an automated model-based vertebral detection, identification, and segmentation method; this model can achieve an identification success of more than 70% for a single vertebra.Wennmann et al. (Wennmann et al., 2022b) explicitly provided segmentations of vertebrae based on U-Nets, which trained 106 patients from eight centers.These findings may strengthen our future studies on DL models, particularly those focusing on automatic segmentation and the identification of spinal diseases.For example, by utilizing the hierarchical structure of the DRN model as the encoder and integrating the U-shaped architecture of segmentation models like U-Net, we may can achieve a sophisticated approach aimed at enhancing the model's versatility and performance.This integrated approach not only enhances the model's versatility, but also opens new possibilities for image segmentation problems in various application scenarios.

Robustness and external generalizability
Previous studies have reported that bone signal intensities and radiomic features can markedly deviate between different scanners and protocols of MRI, computed tomography, and radiography (Wennmann et al., 2022c;Wennmann et al., 2022d).In addition, models that are developed and tested on a single cohort may be prone to overfitting, resulting in the final model performing well only for images in that cohort and showing low accuracy in other datasets (Ueda et al., 2023).
In previous studies, only a single dataset was used for the development of the DL model, and the images used for development and testing were obtained from the same dataset (Chen et al., 2022;Xu et al., 2023).The training and test cohorts of our study were collected from different institutions and using different scanners to prevent overfitting and enhance robustness and external generalizability.In this study, the participants in the training cohort were recruited from three hospitals, which improved the robustness of the DRN model.In addition, two external test cohorts were used to assess the external generalizability of the DRN model, and a high AUC (0.89 and 0.88 in the YHDH and BMUH cohorts, respectively) was achieved.Although variations in DR instruments, image acquisition parameters, age, gender composition, and other factors may contribute to the model's slightly superior results in the YHDH cohort, these results still strongly support the stability and excellent performance of our DRN model on external test cohorts.
It is difficult to identify fresh VCFs, particularly differentiating them from old VCFs, based on DR images in clinical practice.Both fresh and old VCFs can result in similar morphological changes in the vertebral body, such as loss of height, wedging, or deformity.These changes can be visually similar, making it difficult to differentiate between the two types VCFs based solely on the limited information of DR images.Therefore, multiple normal vertebral columns and old VCFs were included in this study to improve the robustness of the model and evaluate whether the DRN model's detection of fresh VCFs would be affected.The results of the subgroup analysis showed that the AUC of the DRN model decreased in the old VCFs group compared to the non-old VCFs group (both in two external test cohorts); however, the p-value between the two subgroups was greater than 0.05 (YHDH, p = 0.067; BMUH, p = 0.051).This indicated that there was no significant difference in the diagnostic performance between the two subgroups, i.e., the diagnostic process of the DRN was not affected by the presence of old VCFs.In addition, as shown in Figure 7, there were no peak responses generated in the area of old VCFs which indicates that the DRN model has learned to

Limitations
This study has some limitations.In general, anteroposterior and lateral radiographs are jointly captured for spinal DR interpretation.However, only lateral views are collected and used to identify fresh VCFs in the present study, and the lack of anteroposterior radiographs may have resulted in the loss of some imaging features in the DR images.In addition, the influence of malignant VCFs on the performance of the DRN model and the ability of the DRN model to distinguish between benign and malignant VCFs were not further evaluated.However, noninvasive differentiation of benign versus malignant VCFs is a potentially important clinical application, particularly for patients with VCFs without typical morphological features.

Conclusion
In the present study, we developed the DRN model for automated diagnosis and identification of fresh VCFs from spinal DR images.The DRN model can provide interpretable attention maps to support the excellent prediction results, which is the key that most clinicians care about when using the model to assist decision-making.

FIGURE 3
FIGURE 3 Attention heatmaps of the DRN model in the external test cohorts.The image on the left is the FS T2WI; The image in the middle is corresponding DR image; the image on the right is the corresponding attention heatmap of the DRN model.The fresh VCFs are pointed by the red boxes.

FIGURE 4
FIGURE 4 Performance of the DRN model in the validation and external test cohorts.ROC curves and confusion matrix of DRN model in (A) the validation cohort, (B) the YHDH cohort and (C) the BMUH cohort.

FIGURE 6
FIGURE 6 Performance of the DRN model in the subgroups of the external test cohorts.ROC curves and confusion matrix of the DRN model in (A) the non-old VCFs group and (B) the old VCFs group of the YHDH cohort.ROC curves and confusion matrix of the DRN model in (C) the non-old VCFs group and (D) the old VCFs group of the BMUH cohort.

FIGURE 7
FIGURE 7 Attention heatmaps of the DRN model in the old VCFs group of the external test cohorts.The image on the left is the DR image of participants without fresh VCFs who were predicted successfully; the image on the right is the corresponding attention heatmap of the DRN model.The old VCFs are pointed by the yellow boxes.

TABLE 1
Baseline characteristics of cohorts.

TABLE 2
Performance of the models in the validation and external test cohorts.

TABLE 3
Baseline characteristics of the subgroups in the external test cohorts.

TABLE 4
Performance of the DRN model for subgroups in the external test cohorts.