- 1Ramsay Santé, Clinique Geoffroy Saint-Hilaire, Paris, France
- 2Research Department, Cortexx Medical Intelligence, Paris, France
- 3Spine Clinic, Polyclinique Jean Villar, ELSAN Group, Bordeaux, France
Background: Lumbar spine disorders are among the most prevalent and disabling conditions worldwide. Patient selection for surgery remains highly complex, and the benefits of surgical interventions remain uncertain, potentially depending on patients’ baseline health characteristics. Patient-related outcome measurements represent a standard method for assessing treatment success in lumbar surgery. The aim of this study is to prospectively validate the accuracy of a deep learning algorithm in predicting the clinical outcomes of patients undergoing lumbar surgery [minimal clinically important difference (MCID)/no-MCID].
Material and methods: This study is multicentric, longitudinal, and prospective study was conducted over a 16-month period (September 2021 to December 2022). Patients with a surgical indication for lumbar decompression were included preoperatively and enrolled in the Surgery Medical Outcomes (SuMO©) mobile application to fill in the preoperative and postoperative data. Patients were classified into two categories according to their postoperative outcomes. The MCID was defined using the Oswestry Disability Index (ODI), combined with the intake of opioids and the presence of motor loss in patients. These results were then compared to the prediction of the algorithm based on preoperative data to determine the accuracy of the algorithm.
Results: A total of 119 patients were enrolled preoperatively, and postoperative follow-up data were obtained for 103 patients. The mean preoperative ODI was 0.43 (SD 0.17). The postoperative mean ODI was 0.28 (SD 0.18) at 1 month and 0.14 (SD 0.16) at 3 months. At 8 months, the mean postoperative ODI in the MCID group was 0.12, while it was 0.26 in the no-MCID group. The algorithm predicted the outcome with an accuracy of 81.6% (receiver operating characteristic score).
Conclusion: This study confirms the validity and accuracy of the algorithm in prospectively predicting postoperative outcomes, as well as the sensitivity of the MCID definition, especially when coupled with remote, patient-centered follow-up. Artificial intelligence-based algorithms may help physicians in their future daily practice by addressing personalized patient care.
Introduction
Lumbar spine pathologies are among the most disabling disorders worldwide, and numerous factors contribute to this pathological condition (1, 2). Precision medicine aims to consider each patient's personal characteristics to select the most appropriate treatment (3).
Lumbar spinal stenosis (LSS) is caused by a gradual narrowing of the spinal canal associated with degenerative changes or disc herniation, and it has become one of the most common indications for spinal surgery (4).
The use of surgery to treat lumbar spinal stenosis has increased over recent decades, as has the precision of surgical procedures (5). These advances aim to improve surgical workflow, patient safety and efficiency, personalized patient care, and, in some cases, minimize the potential risk of future instability and deformity (6).
Patient-reported outcome measures (PROMs) are currently considered the gold standard for evaluating long-term outcomes following spine surgery (7, 8), and they also play a central role in assessing cost-effectiveness across treatment pathways (8). Among these tools, the Oswestry Disability Index (ODI) [a validated 10-item questionnaire that quantifies functional disability on a scale from 0 (no disability) to 100 (maximal disability)], is one of the most widely used PROMs, offering a validated and comprehensive assessment of the functional status of a patient in daily life. Because of its relevance and sensitivity, the ODI is often incorporated into the definition of the minimal clinically important difference (MCID), a threshold that reflects meaningful clinical improvement after surgery.
The MCID has become a key concept in both clinical evaluation and predictive modeling, particularly in the context of lumbar spine surgery. However, the optimal method for defining MCID remains a topic of debate. Thresholds vary across studies and across different PROM instruments, and no consensus has been reached regarding the most appropriate calculation method (9). Recent research has shown that the definition of MCID—whether based on ODI, COMI, or pain-related scales—can significantly impact the results of predictive models (10).
Collecting standardized PROMs, such as the ODI, with high frequency and accuracy is therefore essential for supporting the development of reliable predictive models using machine learning (ML). Such structured data streams enable advanced algorithms, including deep neural networks (DNNs), to establish associations between preoperative patient characteristics and long-term outcomes and, ultimately, to predict the achievement of MCID (11).
The clinical evolution after spine surgery is a major issue in determining the relevance of care of a patient. The ability to preoperatively predict which patients will achieve an MCID—i.e., a significant improvement after lumbar spine surgery, could improve the relevance of surgical indications and follow-up.
Artificial intelligence (AI) and predictive ML in spine surgery rely on a wide variety of models (12), including random forest (13), gradient boosting machines (GBMs) (14), or artificial neural networks (ANNs) (15, 16). Those models have been used to assess the risks of complications (17), predict patient outcomes (18–20) such as MCID (21), estimate the likelihood of readmission (22), or support surgical decision-making (23), including in certain cases helping evaluate the treatment (24, 25). A recent study compared different predictive models to predict MCID based on the Quality Outcomes Database (QOD Study) (21).
However, the benefits and applications of artificial intelligence tools in everyday clinical practice are not yet fully available to surgeons. AI could be a useful tool for exploring new clinical determinants evaluated by traditional studies (26, 27), as well as through the use of specialized generative AI, such as chatbot assistants. Only a few tools have been validated by the FDA for use in daily practice (28). Most AI/ML tools are used in radiology (29).
To improve patient care and unlock daily innovative tools, studies suggest that a greater emphasis should be placed on the distinctive characteristics of AI/ML when defining new AI/ML-based medical devices (30).
To predict patient long-term quality of life (QoL) in a day-to-day practice, we developed a platform that integrates a previously validated deep neural network algorithm (15) with a mobile application. This application prompts patients to fill medical, paramedical, and socio-professional information to establish a preoperative clinical table to complete PROMs at regular intervals to track the evolution of outcomes after surgery.
A previous study on a retrospective cohort showed an accuracy of 0.8 in predicting surgery outcome at 1 month (15).
To evaluate the validity and accuracy of our predictive model, we conducted a prospective study comparing 6-month actual outcomes with the predictions generated by our algorithm.
Methods
Patients were enrolled between September 2021 and July 2022 at two Spine centers in Paris (Clinique Geoffroy-Saint Hilaire) and Bordeaux (Clinique Jean Villar) after a preoperative surgical clinic.
After obtaining informed consent, only patients who had not objected to the use of their anonymized health data were included in the analysis.
Inclusion criteria were as follows:
- Adult >18 years old.
- Eligible for lumbar decompression surgery, whether instrumented or not.
- Covered by social insurance.
- Provided informed consent.
Exclusion criteria were as follows:
- Patients under 18 years of age.
- Pregnant or breastfeeding woman.
- Individuals under safeguard measures or guardianship.
- Arthrodesis involving more than two levels.
- Interventions linked to a traumatic or infectious context.
Ethics statement
This clinical study was conducted in compliance with ethical standards and was approved by the relevant Institutional Review Board(s) and Ethics Committee(s). Specifically, the study protocol received approval from the French Ethics Committee (Commission de Protection des Personnes). The study was also registered at the Drug and Medical Devices French Agency under the identifier IDRCB 2021-A00055–36 and was listed on ClinicalTrials.gov under the registration number NCT05166018.
Written informed consent was obtained from all participants prior to their inclusion in the study. Participants received detailed information regarding the study objectives, procedures, potential risks, and benefits, in accordance with the principles outlined in the Declaration of Helsinki. All data were anonymized to ensure confidentiality and privacy.
All developments carried out were compliant with the General Data Protection Regulation.
Data collection
To evaluate model prediction on standardized data, we developed a dedicated data collection platform—Surgery Medical Outcomes (SuMO) and its associated patient mobile application (see Figures 1, 2).
Figure 1. Homepage SuMO application—this screen shows the main activity of the patient daily use of the SuMO device. Patients can self-evaluate their follow-up by completing personal profile-driven questionnaires, such as PROM, and see their health trends.
Figure 2. SuMO application activity—this screen shows the main action of the patient during their daily self-care. Patients can benefit from access to information and personalized advice tailored to their personalized data.
SuMO is a CE-marked software as a medical device (SaMD) that uses machine-learning-based predictive models to support clinical decision-making throughout the spine surgery care pathway. The platform aggregates pre-, peri-, and postoperative clinical, radiological, and patient-reported data to (i) estimate individual surgical risk profiles; (ii) suggest optimal surgical strategies; (iii) monitor postoperative recovery through longitudinal PROMs, medication use, and imaging; and (iv) deliver tailored educational content and alerts to patients during follow-up to support safe recovery and improve quality of life.
In the present study, SuMO was used exclusively as a digital platform to collect PROMs and clinical follow-up data; its predictive recommendations were not used to guide or modify clinical care.
To reduce negative effects and raise patient awareness of their actions during the episode of care, data were collected via questions associated with the episode of care (see Figure 1).
Preoperative patient characteristics [demographic characteristics, pathological characteristics, imaging features, conditions (history and comorbidities), drug treatments, socio-professional characteristics, physiological characteristics (stress, physical activity, etc.) (see Table 2)], and postoperative data such as PROMs [ODI, numeric rating scale (NRS) Pain] (see Figures 3–5) (at days +3, 15, 30, 45, 60, 75, 90, 120, 150, 180, 210, 240, 270, 300, 330, 360), as well as information on resumption of professional activity, sleep, physical activity, and feeding, were collected through this application.
Figure 3. NRS leg pain follow-up—points are mean values of NRS leg pain for intervals of days [preop, (3; 15 days), ([15; 30 days) …] for all patients in blue, for the MCID group in green, and for the no-MCID group in orange. Polynomial regressions of NRS leg pain are also plotted for the three groups.
Figure 4. NRS low pain follow-up—points are mean values of NRS low pain for intervals of days [preop, (3; 15 days), (15; 30 days) …] for all patients in blue, for the MCID group in green, and for the no-MCID group in orange. Polynomial regressions of NRS low pain are also plotted for the three groups.
Figure 5. Clinical categorical outcome follow-up—points are mean values of categorical outcome declaration for intervals of days [preop, (3; 15 days), (15; 30 days) …] for all patients in blue, for the MCID group in green, and for the no-MCID group in orange. Polynomial regressions of categorical outcomes are also plotted for the three groups.
Perioperative data were collected from medical reports by a clinical research associate (hospitalization, surgical procedure, and intraoperative parameters; see Table 3). In this cohort, 42% of procedures were performed for soft disc herniation with radicular conflict, while the remaining cases involved degenerative posterior facet arthropathy and disc degeneration. The choice of surgical technique (discectomy vs. decompression with or without instrumented fixation) was left to the discretion of the operating surgeon.
Usability was evaluated by the System Usability Scale (SUS) questionnaire (31, 32).
Minimal clinically important difference
During our first study, we chose a minimal clinically important difference (MCID) based on retrospectively collected clinical outcomes to develop and test our deep learning algorithm (15), such as walking distance, motor and sensitivity recovery, anxiety–depression syndrome, and neuropathic pain.
To have a sufficiently relevant MCID in a prospective study and ensure reliable PROM collection, we chose to improve our definition of long-term MCID.
The sensitivity signs improve slowly; the time needed for this criterion to stabilize is incompatible with the goal of this study to rapidly determine the long-term outcomes of the patient.
Anxiety–depression syndrome and neuropathic pain during follow-up were collected only during the perioperative period; therefore, these variables were not retained as criteria.
In this study, surgery success or failure was defined using a composite MCID based on three criteria (see Table 1).
Table 1. MCID used to provide exhaustive information on the general evolution of the patient prospectively.
A MCID is a threshold used to measure the effect of clinical treatments. Various threshold values have been proposed as MCID for different PROM instruments, despite a lack of real consensus on the optimal MCID calculation method (9).
If the patient presents with at least one criterion, the MCID was considered reached.
AI predictive model
The AI predictive model used in this study was based on a deep learning architecture employing an ANN. The model was initially developed and validated on a retrospective cohort and was later trained using a large dataset of synthetic patient records to enhance training diversity and robustness. These synthetic records were generated from real electronic health records (EHRs), with simulated variations across 68 binary-encoded pre- and perioperative clinical, radiological, and psychosocial variables. Each variable was weighted according to its estimated influence on long-term surgical outcomes. The final synthetic dataset included 10,000 training cases and 2,000 test cases, with a balanced distribution of outcomes (15).
The network architecture comprises multiple hidden layers using rectified linear unit (ReLU) activation functions and a sigmoid output node to classify patients into two prognostic categories: favorable (MCID) and unfavorable (no-MCID). The training was performed using binary cross-entropy as the loss function and the Adam optimizer. Further details on model architecture and training protocol are available in our previous study.
For this prospective evaluation, the model was applied without modification to real patient data collected through the SuMO platform. We used a complete-case analysis strategy, restricting the testing dataset to patients with a follow-up period exceeding 90 days, thus allowing for a reliable comparison of outcomes. The model used both preoperative and perioperative variables to predict long-term MCID achievement, supporting its potential for integration into real-time clinical workflows.
Statistical analysis
Data processing and statistical analysis were conducted using Python (v3.9), with Pandas (v1.5.3) and NumPy (v1.24.4) for data manipulation and SciPy (v1.10.1) for statistical testing.
Patients were divided into two groups based on outcomes: MCID and no-MCID, according to a composite criterion (ODI < 0.2, absence of motor deficit, and no regular use of opioids or anti-inflammatory medications). To evaluate differences in clinical characteristics between these two groups, variables were categorized as either continuous or dichotomous.
For continuous variables, Pearson's correlation and linear regression were used to assess intervariable associations, while group comparisons were performed using chi-squared tests of independence. For associations between categorical variables, chi-squared tests were applied. When analyzing relationships between categorical and continuous variables, point–biserial correlations were calculated.
To explore the most relevant features associated with postoperative outcomes, each variable was individually compared between the MCID and no-MCID groups using the tests outlined above. Statistical significance was set at α = 0.05 (two-tailed).
For postoperative longitudinal variables such as ODI, NRS for leg pain (NRS-LP), NRS for back pain (NRS-BP), and other PROMs collected at regular intervals, linear interpolation was applied to impute missing intermediate values when at least two surrounding data points were available. No extrapolation was performed beyond the final available patient input; patients who stopped entering follow-up data were excluded from further timepoint analyses.
To assess model performance, predictions from the ANN were compared with actual outcomes using standard classification metrics: sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and global accuracy. A receiver operating characteristic (ROC) curve was generated, and the area under the curve (AUC) was calculated to evaluate overall model discrimination. A confusion matrix was constructed to assess prediction performance across both classes.
TRIPOD-AI
The present article follows the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) guidelines for reporting the development and validation of the prediction model (33).
Results
Population
A total of 174 patients were included in this study. Three patients who were operated on in another center were excluded. An additional 24 patients were lost to follow-up, and 15 patients experienced technical problems (email address problem, no internet access, or not having a smartphone). Thirteen patients ultimately did not undergo surgery and changed their treatment strategy (see Figure 6).
A total of 119 patients were followed to the end of the study with complete EHRs. Table 2 presents the baseline preoperative characteristics of patients, and Table 3 presents the perioperative characteristics of patients recorded during back surgery. For each variable, the mean, median, standard deviation, and number of respondents are reported.
Table 3. Baseline perioperative characteristics of patients—these data are obtained by the exploitation of medical reports by clinical research associates.
Of the 119 patients included, 61 had 90 days of follow-up and provided data. Among these, 29 (47%) patients were good recipients of surgery (MCID, as defined in Table 1) and 32 (53%) patients were not (no-MCID).
Follow-up
ODI
ODI scores of patients are presented in Figure 7.
Figure 7. Oswestry Disability Index follow-up—points are mean values of ODI for intervals of days [preop, (3; 15 days), (15; 30 days) …] for all patients in blue, for the MCID group in green, and for the no-MCID group in orange. Polynomial regression of ODIs is also plotted for the three groups.
In the overall population, ODI showed a continuous decrease from 43 to 17 at 90 days (−60%) and then stabilized at 8 months after surgery, showing the global positive impact of the surgical procedure on the long-term QoL.
ODI scores decreased continuously from baseline to 90 days after surgery, falling from 48 to 26 in the no-MCID (−46%) group and from 38 to approximately 7 in the MCID group (−82%). Then, the two groups behave differently: in the no-MCID group, ODI stabilized at approximately 27 by 240 days (−44% overall), whereas in the MCID group, ODI continued to decrease, reaching approximately 5 (−87% overall).
Pain
NRS pain scores of patients are presented in Figures 3, 4.
Leg pain decreased in both groups until 2 months, reaching 2.3 in the MCID group (−64% from baseline) and 3.3 in the no-MCID group (−43%). Beyond that point, pain scores stabilized and slowly decreased until 8 months, reaching approximately 1.5 in the MCID group (−76% overall) and 3.2 in the no-MCID group (−45%) (see Figure 10 in the Appendix).
Low back pain decreased in both groups until the 3-month mark, reaching 1.9 in the MCID group (−63% from baseline) and 3.3 in the no-MCID group (−38%). After 3 months, average low back pain began to rise again, reaching 2.2 at 8 months in the MCID (−57% overall) group and 4.3 in the no-MCID group (−19% overall). The overall trends were similar in both the MCID and no-MCID groups; however, the rebound was weaker in the MCID group. Ultimately, the final improvement was −57% for the MCID and only −19% for the no-MCID population.
Clinical outcomes
Categorical outcomes after surgery are presented in Figure 5.
Nonsteroidal anti-inflammatory drugs (NSAIDs) and opioid intake decreased quickly to under 10% at 3 months and remained globally stable up to 8 months. Motor loss and leg tingling were relatively stable and oscillated around 40%, with a tendency to decrease toward 30% by 8 months.
Feeding returned to near 100% after 1 month, while sleep quality improved rapidly and stabilized at approximately 80% after 2–8 months. Return to physical activity was achieved by 60% of the population after 3 months.
Prediction of the model
The model showed an accuracy of 70%, with an ROC score of 0.816. Sensitivity was 81%, specificity was 59%, PPV was 68%, and NPV was 74% (see Figure 8 and Table 4).
Risk factors and characteristics for MCID or no-MCID outcomes
In Table 5, the criteria are displayed for the two groups of patients (MCID/no-MCID). Variables are sorted in ascending order of p-values to highlight the most prominent criteria for the long-term prognosis (MCID/no-MCID) of the patients. These criteria were practicing a specific sport, stenosis from imaging data, motor loss during the episode, arthritis from imaging data, stress and family conflict, metabolic equivalent of task (MET) of 10, MET between 7 and 10, metabolic equivalent of task between 4 and 7, metabolic equivalent of task <4, appetite disorder(s), and history of previous surgery (p < 0.05) (see Table 5).
Table 5. Characteristics of MCID and no-MCID participants on continuous (t) and categorical predictors (χ2).
Satisfaction of platform users
The SUS questionnaire analysis shows a general satisfaction rating of 68.5 among users.
With 60 total hours spent on the questionnaires by all patients, each patient spent an average of 30 min on the questionnaires to fill in the equivalent of 178 health data points per patient during the 6-month follow-up.
Discussion
The population analyzed in this prospective study is consistent with that observed in recent retrospective analyses of lumbar spine surgery outcomes, both in baseline characteristics and postoperative trajectories (34–36). The 6-month clinical outcomes we report align with trends described in the literature for similar interventions and follow-up durations (37–39).
In our cohort, a marked divergence was observed between the MCID and no-MCID groups, reflecting a real dichotomy in postoperative recovery, particularly in the return to patient autonomy—a finding consistent with previously published large-scale cohorts (40).
Among the outcome measures, leg pain (NRS-LP) demonstrated a stronger correlation with improvements in quality of life than low back pain (NRS-BP) (see Figure 9 in the Appendix). This result aligns with prior assumptions regarding the prognostic value of radicular symptom resolution (41, 42). Although the pain delta (the absolute change in NRS score) can also serve as an indicator, its subjectivity and variability across individuals limit its reliability as a universal MCID criterion. By contrast, the ODI, which integrates functional impairment into day-to-day activities, provides a more stable and less biased measure of true postoperative recovery.
The correlation between analgesic intake and pain levels further reinforces the validity of our MCID definition. In the MCID group, pain reduction was accompanied by decreased or discontinued opioid and NSAID use, whereas persistent pain and continued drug intake in the no-MCID group reflected a more degraded quality of life. These distinctions support the use of a composite MCID definition—combining functional, symptomatic, and treatment-related parameters—to better stratify long-term surgical success.
Importantly, this definition also allowed clear segregation of postoperative categorical outcomes. Variables such as resumed physical activity, sleep recovery, persistent neurological symptoms, and analgesic consumption exhibited different behaviors in the MCID vs. no-MCID groups. These findings highlight that relying on a single dimension (e.g., pain score alone) may be insufficient and that multicriteria outcome models are essential for capturing the complex nature of recovery after spine surgery.
We also identified several individual preoperative characteristics significantly associated with favorable outcomes. Higher physical activity (as measured by the MET score), engagement in specific sports, and the absence of stress-related symptoms (e.g., family conflicts, appetite or digestive disorders) were predictive of MCID achievement. Conversely, patients with prior spine surgery or chronic comorbidities were more likely to fall into the no-MCID group. Notably, work-related variables did not show a significant predictive value in our cohort, despite their prominence in some published models.
The predictive performance of our ANN model is consistent with prior efforts to classify patients into MCID and no-MCID groups (10). However, the definition of MCID remains an ongoing point of discussion. The selection of thresholds, outcome scales (e.g., ODI vs. COMI vs. NRS), and timepoints can significantly influence predictive capacity (43, 44). Our model was validated at 6 months, a commonly accepted intermediate endpoint used in spine surgery, but longer-term follow-up (12–24 months) will be necessary to confirm the durability of both outcomes and predictions.
One of the strengths of our approach lies in the high-frequency, standardized collection of PROMs through a dedicated mobile platform. This enabled consistent data acquisition across patients and timepoints—an advantage not commonly found in previous studies relying on traditional registry-based designs. The high level of patient engagement, demonstrated by good response rates and usability scores, further supports the feasibility of embedding digital PROMs into real-world clinical pathways.
Nevertheless, several limitations must be acknowledged. First, the model was trained partly on synthetic data derived from retrospective EHRs. While this approach allowed for increased data volume and improved balance, synthetic data may not fully capture the variability of real clinical scenarios. Retraining the model on larger, fully prospective datasets would improve its generalizability and robustness. Second, although our model showed strong performance at 6 months (AUC 0.82), this time frame may be too short to capture delayed outcomes such as relapse, reoperation, or long-term functional plateau. Ongoing follow-up will allow us to assess its predictive value beyond early postoperative recovery.
Third, the sample size of our cohort remains limited for generalization. Although our results are comparable to those reported in smaller validated studies (AUC ∼0.83) (41), large-scale applications—such as those tested on broader registries—often show reduced accuracy (AUC ∼0.63) due to heterogeneous and incomplete data (10). Our approach, relying on harmonized and curated PROMs, may help mitigate these limitations, but external validation remains essential.
Alternative modeling approaches, such as logistic regression, could be explored to compare performance against ANN models, particularly given our cohort size and feature volume. However, our objective was not only to validate performance but also to build a scalable and evolutive model. Future iterations will integrate additional layers of data including imaging, genomic, and biological biomarkers (45), thus enhancing model personalization and long-term predictive strength.
The broader goal of our platform is to support a more individualized surgical decision-making process. By combining structured data collection with predictive modeling, clinicians can better anticipate outcomes, refine indications, and tailor postoperative monitoring. This anticipatory strategy may help reduce therapeutic failures—defined here as no-MCID outcomes—which are closely associated with complications, extended recovery, and patient dissatisfaction.
Such efforts align with initiatives like the International Spine Study Group (ISSG), which have shown that predictive models improve risk stratification and reduce uncertainty in surgical planning (46). AI-driven tools can complement surgeon judgment by identifying latent patterns in complex datasets, thereby enhancing both precision and confidence in care decisions (47).
Conclusion
We found that our model demonstrated good precision in classifying long-term MCID outcomes among patients undergoing low back surgery. We also showed that an MCID definition incorporating multiple clinical outcomes is crucial for classifying patients in groups that best fit their evolution not only in terms of pain or autonomy but also regarding other outcomes like resumption of sleep, physical activity, or intake of medication. Thus, data collection via a smartphone application with high frequency represents an efficient approach for determining the long-term quality of life of patients.
The artificial neural network model used in this clinical study lends itself undeniably to holistic and multimodal prediction of long-term patient outcomes. Nonetheless, we also showed that ANN lacks an explanation for the criteria leading to the classification of patients.
The model we developed can help patients in understanding their prognosis and in tailoring a pre- and postoperative programs aimed at improving long-term quality of life. It is an essential and indispensable building block in the anticipated construction of a coherent health pathway, in terms of timing and methods, raising the standard of care with the help of artificial intelligence. A new prospective study is necessary to test the medical service to patients and enlighten the benefits of such a solution in the health pathway of patients undergoing spine surgery.
In conclusion, this study validates the real-world predictive capacity of an ANN model integrated into a mobile PROMs platform to anticipate the achievement of MCID after lumbar spine surgery. While further prospective validation and model refinement are necessary, this work illustrates a practical approach to integrating AI into daily surgical practice, promoting more personalized, data-driven, and adaptive spine care.
Data availability statement
The original contributions presented in the study are included in the article/Supplementary Material; further inquiries can be directed to the corresponding author.
Ethics statement
The studies involving humans were approved by the Comité de Protection des Personnes, a national committee. The studies were conducted in accordance with the local legislation and institutional requirements. The participants provided their written informed consent to participate in this study.
Author contributions
AA: Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing. BP: Data curation, Formal analysis, Funding acquisition, Software, Writing – review & editing. J-JV: Funding acquisition, Methodology, Project administration, Writing – review & editing. LB: Data curation, Investigation, Writing – review & editing. IO: Data curation, Investigation, Writing – review & editing.
Funding
The author(s) declare that financial support was received for this and/or its publication. Grants were received from Malakoff Humanis Innovation Santé.
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declare that Generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence, and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Footnote
Abbreviations PROMs, patient-reported outcome measures; ODI, Oswestry Disability Index; MCID, minimal clinically important difference; NRS, numeric rating scale; NRS-LP, NRS for leg pain; NRS-BP, NRS for back pain; QoL, quality of life; SUS, System Usability Scale; AI/ML, artificial intelligence/machine learning; ANN, artificial neural network; DNN, deep neural network; ROC, receiver operating characteristic; AUC, area under the curve; PPV, positive predictive value; NPV, negative predictive value; LSS, lumbar spinal stenosis; LOS, length of stay; ASA, American Society of Anesthesiologists; NYHA, New York Heart Association; EHR, electronic health record; BMI, body mass index; MET, metabolic equivalent of task; COPD, chronic obstructive pulmonary disease; TRIPOD, Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis; SuMO©, Surgery Medical Outcomes (a data collection platform); ReLU, rectified linear unit.
References
1. Konstantinou K, Dunn KM, Ogollah R, Vogel S, Hay EM, ATLAS study research team. Characteristics of patients with low back and leg pain seeking treatment in primary care: baseline results from the ATLAS cohort study. BMC Musculoskelet Disord. (2015) 16:332. doi: 10.1186/s12891-015-0787-8
2. Kigozi J, Konstantinou K, Ogollah R, Dunn KM, Martyn L, Jowett S. Factors associated with costs and health outcomes in patients with back and leg pain in primary care: a prospective cohort analysis. BMC Health Serv Res. (2019) 19(1):406. doi: 10.1186/s12913-019-4257-0
3. André A, Vignaux J-J. Precision medicine. In: Rivas H, Wac K, editors. Digital Medicine. Cham: Springer International Publishing (2018). p. 49–58.
4. Katz JN, Zimmerman ZE, Mass H, Makhni MC. Diagnosis and management of lumbar spinal stenosis. JAMA. (2022) 327(17):1688. doi: 10.1001/jama.2022.5921
5. Inose H, Kato T, Sasaki M, Matsukura Y, Hirai T, Yoshii T, et al. Comparison of decompression, decompression plus fusion, and decompression plus stabilization: a long-term follow-up of a prospective, randomized study. Spine J. (2022) 22(5):747–55. doi: 10.1016/j.spinee.2021.12.014
6. Goldberg JL, Härtl R, Elowitz E. Minimally invasive spine surgery: an overview. World Neurosurg. (2022) 163:214–27. doi: 10.1016/j.wneu.2022.03.114
7. Patel MR, Jacob KC, Parsons AW, Vanjani NN, Cha EDK, Lynch CP, et al. How do patient-reported outcomes vary between lumbar fusion patients with complete versus incomplete follow-up? World Neurosurg. (2022) 158:e717–25. doi: 10.1016/j.wneu.2021.11.041
8. Lee TJ, Klineberg EO, Ames CP, Kim HJ, Scheer JK, Shaffrey CI, et al. Cost-effectiveness applications of patient-reported outcome measures (PROMs) in spine surgery. Clin Spine Surg. (2020) 33(4):140–5. doi: 10.1097/BSD.0000000000000982
9. Copay AG, Glassman SD, Subach BR, Berven S, Schuler TC, Carreon LY. Minimum clinically important difference in lumbar spine surgery patients: a choice of methods using the oswestry disability Index, medical outcomes study questionnaire short form 36, and pain scales. Spine J. (2008) 8(6):968–74. doi: 10.1016/j.spinee.2007.11.006
10. Halicka M, Wilby M, Duarte R, Brown C. Predicting patient-reported outcomes following lumbar spine surgery: development and external validation of multivariable prediction models. BMC Musculoskelet Disord. (2023) 24(1):333. doi: 10.1186/s12891-023-06446-2
11. Charles YP, Lamas V, Ntilikina Y. Artificial intelligence and treatment algorithms in spine surgery. Orthop Traumatol Surg Res. (2023) 109(1):103456. doi: 10.1016/j.otsr.2022.103456
12. Senders JT, Staples PC, Karhade AV, Zaki MM, Gormley WB, Broekman ML, et al. Machine learning and neurosurgical outcome prediction: a systematic review. World Neurosurg. (2018) 109:476–486.e1. doi: 10.1016/j.wneu.2017.09.149
13. Cabrera A, Bouterse A, Nelson M, Razzouk J, Ramos O, Chung D, et al. Use of random forest machine learning algorithm to predict short term outcomes following posterior cervical decompression with instrumented fusion. J Clin Neurosci. (2023) 107:167–71. doi: 10.1016/j.jocn.2022.10.029
14. Khan O, Siu A, Shafiq B, Ghaith A, Yee T, Fehlings MG, et al. Machine earning algorithms for prediction of health-related quality of life after surgery for mild degenerative cervical myelopathy. Spine J. (2021) 21(10):1659–69. doi: 10.1016/j.spinee.2020.02.003
15. André A, Peyrou B, Vignaux J-J, Lonjon G, Kadoch V, Lefranc M, et al. Feasibility and assessment of a machine learning-based predictive model of outcome after lumbar decompression surgery. Global Spine J. (2022) 12(5):894–908. doi: 10.1177/2192568220969373
16. Zhang Z, Cogswell M, Lu Z. Ensuring electronic medical record simulation through better training, modeling, and evaluation. J Am Med Inform Assoc. (2020) 27(1):99–108. doi: 10.1093/jamia/ocz161
17. Perfetti DC, Mundis GM, Eastlack RK, Carreon LY, Glassman SD, Polly DW, et al. Risk factors for reoperation after lumbar total disc replacement at short-, mid-, and long-term follow-up. Spine J. (2021) 21(7):1110–7. doi: 10.1016/j.spinee.2021.02.020
18. Hornung AL, Hornung CM, Mallow GM, Barajas JN, Rush AJ III, Sayari AJ, et al. Artificial intelligence in spine care: current applications and future utility. Eur Spine J. (2022) 31(8):2057–81. doi: 10.1007/s00586-022-07176-0
19. Haug CJ, Drazen JM. Artificial intelligence and machine learning in clinical medicine. N Engl J Med. (2023) 388(13):1201–8. doi: 10.1056/NEJMra2302038
20. Pedersen CF, Andersen MO, Carreon LY, Andersen T, Solberg TK, Kolstad F, et al. Applied machine learning for spine surgeons: predicting outcome for patients undergoing treatment for lumbar disc herniation using PRO data. Global Spine J. (2022) 12(5):866–76. doi: 10.1177/2192568220967643
21. Park C, Kim Y, Lee J, Kim CH, Kim HJ, Kwon J-W, et al. Which supervised machine learning algorithm can best predict achievement of minimum clinically important difference in neck pain after surgery in patients with cervical myelopathy? Neurosurg Focus. (2023) 54(6):E5. doi: 10.3171/2023.3.FOCUS2372
22. Hopkins BS, Yamaguchi JT, Garcia R, Smith ZA, Pham MH. Using machine learning to predict 30-day readmissions after posterior lumbar fusion: an NSQIP study involving 23,264 patients. J Neurosurg Spine. (2019) 31(5):1–8. doi: 10.3171/2019.9.SPINE19860
23. Staartjes VE, de Wispelaere MP, Schroder ML. Initial classification of low back and leg pain based on objective functional testing: a pilot study of machine learning applied to diagnostics. Eur Spine J. (2020) 29(7):1702–8. doi: 10.1007/s00586-020-06343-5
24. Pennings JS, Devin CJ, McGirt MJ, Youssef JA, Foley KT, Asher AL, et al. Prediction of oswestry disability Index (ODI) using PROMIS-29 in a national sample of lumbar spine surgery patients. Qual Life Res. (2019) 28(10):2839–50. doi: 10.1007/s11136-019-02223-8
25. Ghogawala Z, Dunbar M, Essa I. Artificial intelligence for the treatment of lumbar spondylolisthesis. Neurosurg Clin N Am. (2019) 30(3):383–9. doi: 10.1016/j.nec.2019.02.012
26. Tamai K, et al. Improvements in mental well-being and its predictive factors in patients who underwent cervical versus lumbar decompression surgery. Spine Surg Relat Res. (2021) 6(1):10–6. doi: 10.22603/ssrr.2021-0060
27. Vraa ML, Pedersen T, Jakobsen LH, Lauritsen ML, Andersen MO, Carreon LY, et al. More than 1 in 3 patients with chronic low back pain continue to use opioids long-term after spinal fusion. Clin J Pain. (2021) 38(3):222–30. doi: 10.1097/AJP.0000000000001006
28. Muehlematter UJ, Bluethgen C, Vokinger KN. FDA-cleared artificial intelligence and machine learning-based medical devices and their 510(k) predicate networks. Lancet Digit Health. (2023) 5(9):e618–26. doi: 10.1016/S2589-7500(23)00126-7
29. Yearley AG, Shlobin NA, Shlobin OA, Wilson JD, Kansagra AP, Mossa-Basha M, et al. FDA-approved machine learning algorithms in neuroradiology: a systematic review of the current evidence for approval. Artif Intell Med. (2023) 143:102607. doi: 10.1016/j.artmed.2023.102607
30. van Olmen J. The promise of digital self-management: a reflection about the effects of patient-targeted e-health tools on self-management and wellbeing. Int J Environ Res Public Health. (2022) 19(3):1360. doi: 10.3390/ijerph19031360
31. Broak J. SUS: A ‘Quick and Dirty’ Usability Scale, in Usability Evaluation in Industry. CRC Press (1996). p. 207–12.
32. Gronier G, Baudet A. Psychometric evaluation of the F-SUS: creation and validation of the French version of the System Usability Scale. Int J Hum Comput. (2021) 37(16):1571–82. doi: 10.1080/10447318.2021.1898828
33. Collins GS, Moons KGM, Altman DG, Reitsma JB, Riley RD, Wolff RF, et al. Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ Open. (2021) 11(7):e048008. doi: 10.1136/bmjopen-2020-048008
34. Mo KC, Parenteau CS, Whitmore RG, Passias PG, Raman T, Jalai CM, et al. Pain self-efficacy (PSEQ) score of <22 is associated with daily opioid use, back pain, disability, and PROMIS scores in patients presenting for spine surgery. Spine J. (2023) 23(5):723–30. doi: 10.1016/j.spinee.2022.12.015
35. Weir S, Samad Z, Elhadi M, Nair R, McIntosh E, Sheikh A, et al. The incidence and healthcare costs of persistent postoperative pain following lumbar spine surgery in the UK: a cohort study using CPRD and HES. BMJ Open. (2017) 7(9):e017585. doi: 10.1136/bmjopen-2017-017585
36. Hegarty DA. Multivariate prognostic modeling of persistent pain following lumbar discectomy. Pain Physician. (2012) 15:421–34. doi: 10.36076/ppj.2012/15/421
37. Karlsson T, Thomas K, Hedlund R, Ziegler J, Strömqvist B, Berg S, et al. Decompression alone or decompression with fusion for lumbar spinal stenosis: a randomized clinical trial with two-year MRI follow-up. Bone Joint J. (2022) 104-B(12):1343–51. doi: 10.1302/0301-620X.104B12.BJJ-2022-0340.R1
38. Hermansen E, Austevoll IM, Lønne G, Solberg T, Rekeland F, Indrekvam K, et al. Comparison of three different minimally invasive surgical techniques for lumbar spinal stenosis: a randomized clinical trial. JAMA Netw Open. (2022) 5(3):e224291. doi: 10.1001/jamanetworkopen.2022.4291
39. Lim WSR, Hee HT, Wong HK, Yue WM, Tan SB, Oh JY, et al. Women do not have poorer outcomes after minimally invasive lumbar fusion surgery: a five-year follow-up study. Int J Spine Surg. (2020) 14(5):756–61. doi: 10.14444/7108
40. Houra K, Oshima Y, Shibayama M, Inanami H, Koga H, Takahashi H, et al. Long-term clinical outcomes following endoscopic foraminoplasty for patients with single-level foraminal stenosis of the lumbar spine. Int J Spine Surg. (2022) 16(1):139–50. doi: 10.14444/8182
41. Quddusi A, Kelly MP, Adams J, Ames CP, Klineberg EO, Sciubba DM, et al. External validation of a prediction model for pain and functional outcome after elective lumbar spinal fusion. Eur Spine J. (2019) 29(2):374–83. doi: 10.1007/s00586-019-06189-6
42. Hamilton T, Park P, Khalil JG, Cheng JS, Bisson EF, Hoh DJ, et al. Association of prolonged symptom duration with poor outcomes in lumbar spine surgery: a Michigan spine surgery improvement collaborative study. J Neurosurg Spine. (2023) 38(5):1–10.
43. Bielewicz J, Daniluk B, Kamieniak P. VAS and NRS, same or different? Are visual analog scale values and numerical rating scale equally viable tools for assessing patients after microdiscectomy? Pain Res Manag. (2022) 2022:5337483. doi: 10.1155/2022/5337483
44. Power JD, Yee A, Ailon T, Dea N, Dvorak MF, Fisher CG, et al. Determining minimal clinically important difference estimates following surgery for degenerative conditions of the lumbar spine: analysis of the CSORN registry. Spine J. (2023) 23(9):1323–33. doi: 10.1016/j.spinee.2023.05.001
45. Javidi H, Liu J, Mahjouri-Samani M, Klineberg EO, Ames CP, Lau D, et al. Identification of robust deep neural network models of longitudinal clinical measurements. NPJ Digit Med. (2022) 5(1):106. doi: 10.1038/s41746-022-00651-4
46. Pellisé F, Vila-Casademunt A, Ferrer M, Domingo-Sàbat M, Bagó J, Pérez-Grueso FJS, et al. Development and validation of risk stratification models for adult spinal deformity surgery. J Neurosurg Spine. (2019) 31(4):587–99. doi: 10.3171/2019.3.SPINE181452
47. Sacks GD, Dawes AJ, Ettner SL, Brook RH, Fox CR, Maggard-Gibbons M, et al. Impact of a risk calculator on risk perception and surgical decision making: a randomized trial. Ann Surg. (2016) 264(6):889–95. doi: 10.1097/SLA.0000000000001750
Appendix
In this Appendix, Figure 9 shows the correlation between pain intensity (NRS for leg pain and low back pain) and disability (ODI). Figure 10 shows the longitudinal evolution of the ODI score up to 400 days after surgery.
Figure 9. Correlation between pain intensity and disability. Scatter plot showing the relationship between ODI and NRS pain scores (leg pain and low back pain). Linear regression trendlines with associated statistics (R2, p-value, and intercept) are displayed.
Figure 10. Longitudinal evolution of ODI after surgery. Scatter plot of ODI values over time (up to 400 days postoperatively). A fitted curve illustrates the overall trend, and dashed lines indicate mean ODI values across predefined postoperative time windows (preoperative, < 15 days, <1 month, <3 months, and >5 months).
Keywords: spine surgeries, artificial intelligence, machine learning, patient-reported outcome measurement (PROM), minimal clinically important differences (MCID)
Citation: André A, Peyrou B, Vignaux J-J, Boissière L and Obeid I (2026) Validity and accuracy of a machine learning predictive model in the exploitation of patient-related outcomes in spine surgery. Front. Surg. 12:1710512. doi: 10.3389/fsurg.2025.1710512
Received: 22 September 2025; Revised: 18 November 2025;
Accepted: 29 November 2025;
Published: 9 January 2026.
Edited by:
Luca Ricciardi, Sapienza University of Rome, ItalyReviewed by:
Marilou Cavaliere, Azienda Ospedaliera Fatebenefratelli e Oftalmico, ItalyNikolay Gabrovsky, UMHATEM N. I. Pirogov, Bulgaria
Copyright: © 2026 André, Peyrou, Vignaux, Boissière and Obeid. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Jean-Jacques Vignaux, amoudmlnbmF1eEBjb3J0ZXh4bWkuY29t
†These authors have contributed equally to this work
Bruno Peyrou2,†