Explainable artificial intelligence for gait analysis: advances, pitfalls, and challenges - a systematic review

Xiang, Liangliang; Gao, Zixiang; Yu, Peimin; Fernandez, Justin; Gu, Yaodong; Wang, Ruoli; Gutierrez-Farewik, Elena M.

doi:10.3389/fbioe.2025.1671344

SYSTEMATIC REVIEW article

Front. Bioeng. Biotechnol., 30 October 2025

Sec. Biomechanics

Volume 13 - 2025 | https://doi.org/10.3389/fbioe.2025.1671344

This article is part of the Research TopicRevolutionizing sports science: Biomechanical models, wearable tech, and AIView all 14 articles

Explainable artificial intelligence for gait analysis: advances, pitfalls, and challenges - a systematic review

Liangliang Xiang¹*^†

Zixiang Gao²^†

Peimin Yu^3,4

Justin Fernandez^4,5

Yaodong Gu^3,4

Ruoli Wang^1,6

Elena M. Gutierrez-Farewik^1,6

¹KTH MoveAbility, Department of Engineering Mechanics, KTH Royal Institute of Technology, Stockholm, Sweden
²Human Performance Laboratory, Faculty of Kinesiology, University of Calgary, Calgary, AB, Canada
³Faculty of Sports Science, Ningbo University, Ningbo, China
⁴Auckland Bioengineering Institute, University of Auckland, Auckland, New Zealand
⁵Department of Engineering Science and Biomedical Engineering, University of Auckland, Auckland, New Zealand
⁶Department of Women’s and Children’s Health, Karolinska Institute, Stockholm, Sweden

Machine learning (ML) has emerged as a powerful tool to analyze gait data, yet the “black-box” nature of many ML models hinders their clinical application. Explainable artificial intelligence (XAI) promises to enhance the interpretability and transparency of ML models, making them more suitable for clinical decision-making. This systematic review, registered on PROSPERO (CRD42024622752), assessed the application of XAI in gait analysis by examining its methods, performance, and potential for clinical utility. A comprehensive search across four electronic databases yielded 3676 unique records, of which 31 studies met inclusion criteria. These studies were categorized into model-agnostic (n = 16), model-specific (n = 12), and hybrid (n = 3) interpretability approaches. Most applied local interpretation methods such as SHAP and LIME, while others used Grad-CAM, attention mechanisms, and Layer-wise Relevance Propagation. Clinical populations studied included Parkinson’s disease, stroke, sarcopenia, cerebral palsy, and musculoskeletal disorders. Reported outcomes highlighted biomechanically relevant features such as stride length and joint angles as key discriminators of pathological gait. Overall, the findings demonstrate that XAI can bridge the gap between predictive performance and interpretability, but significant challenges remain in standardization, validation, and balancing accuracy with transparency. Future research should refine XAI frameworks and assess their real-world clinical applicability across diverse gait disorders.

1 Introduction

Gait is a complex motor activity governed by neuromuscular coordination and biomechanics, and it serves as a key indicator of an individual’s overall health status. In clinical practice, gait analysis is widely used to detect pathological changes such as freezing of gait in Parkinson’s disease, post-stroke hemiparetic gait, or compensatory patterns following musculoskeletal injury (Filtjens et al., 2021; Apostolidis et al., 2023). For example, approximately 60%–80% of stroke survivors experience gait impairments (Cirstea, 2020), and more than 80% of individuals with Parkinson’s disease develop gait disturbances during the disease course (Faerman et al., 2025). Thus, gait analysis has become a critical tool across clinical, sports, and rehabilitation contexts, enabling the assessment of movement patterns and functional impairments (Winter, 2009). As populations age and mobility-related issues become more prevalent, the need for accurate and comprehensive gait evaluation methods continues to grow.

Marker-based motion capture systems have been the gold standard in gait analysis for decades, offering comprehensive tracking of whole-body kinematics with high temporal and spatial resolution. In addition to these systems, gait analysis has employed various technologies such as electromyography, pressure mats, wearable sensors, and bi-planar fluoroscopy. While bi-planar fluoroscopy provides high-precision tracking of skeletal motion, its application is limited to small capture volumes and specific anatomical regions. Moreover, the high cost, limited accessibility, and radiation exposure associated with fluoroscopy restrict its suitability for routine clinical use (Kessler et al., 2019).

Despite their accuracy, optical marker-based systems are typically limited to controlled laboratory environments, reducing their feasibility for broader, real-world assessments (Chen et al., 2016). Recently, markerless motion capture technology has emerged as a promising alternative, using computer vision algorithms to track body movements without physical markers (Wade et al., 2022; Uhlrich et al., 2023). While markerless systems avoid issues related to marker placement, they often struggle with accurate pelvis tracking and their performance depends heavily on the training data, limiting applicability to populations with atypical gait, such as prosthesis users. Wearable sensors, such as accelerometers and gyroscopes, enable data collection in real-world settings but face challenges including signal drift, calibration errors, and soft tissue artifacts, which can compromise accuracy (Chen et al., 2016; Xiang et al., 2022b; Xiang et al., 2023). While these technologies have advanced the field considerably, each has limitations in accuracy, accessibility, and cost, which constrain their widespread use and the ability to comprehensively understand complex gait patterns.

Machine learning (ML) has become a powerful and transformative tool in biomechanics to address some of the limitations in traditional gait analysis methods (Halilaj et al., 2018; Xiang et al., 2022b; Xiang et al., 2025). ML algorithms can process large, high-dimensional datasets to extract meaningful features, enabling accurate classification of walking patterns, detection of gait abnormalities, and prediction of joint mechanics and clinical outcomes (Ferber et al., 2016; Halilaj et al., 2018; Phinyomark et al., 2018; Harris et al., 2022; Xiang et al., 2022a; 2023; 2024; Gao et al., 2023; Mekni et al., 2025a; Mekni et al., 20252025b). By automating the analysis process, ML reduces reliance on manual interpretation, offering a more efficient and scalable approach. However, these advancements come with challenges, primarily concerning the ‘black-box’ nature of many ML models, which prioritize predictive accuracy at the expense of transparency, rather than the inherently interpretable ‘clear-box’ models, such as linear regression or simple decision trees. This lack of interpretability raises concerns about their reliability, particularly in clinical contexts where transparency is critical for informed decision-making (Adadi and Berrada, 2018; Salahuddin et al., 2022; Yang et al., 2022; Albahri et al., 2023; Frasca et al., 2024).

Explainable Artificial Intelligence (XAI) has the potential to bridge the interpretability gap by providing insights into how and why ML models make specific predictions. In healthcare, XAI has become a prominent tool, particularly for enhancing the trustworthiness and explainability of AI-driven outcomes (Adadi and Berrada, 2018; Tjoa and Guan, 2020; Vilone and Longo, 2021; Loh et al., 2022; Minh et al., 2022; Salahuddin et al., 2022; Yang et al., 2022; Albahri et al., 2023), though its application in gait analysis remains relatively new. However, concerns have been raised that post hoc explanations may sometimes be misleading or provide only superficial insights, risking bias or false reassurance if not carefully validated (Ghassemi et al., 2021). Addressing these concerns, XAI techniques have the potential to identify which features most influence model outputs, thereby supporting clinicians and researchers in interpreting gait patterns and making more informed decisions (Harris et al., 2022). By improving model transparency, XAI not only fosters trust but also facilitates the discovery of novel biomechanical insights.

Techniques such as Local Interpretable Model-agnostic Explanations (LIME) (Ribeiro et al., 2016) and Shapley Additive Explanations (SHAP) (Lundberg and Lee, 2017) can be applied to a wide range of machine learning models to identify influential input features. This flexibility is particularly valuable in gait analysis, where such methods can reveal biomechanical patterns underlying model predictions and enhance the interpretability of complex algorithms (Dindorf et al., 2020; Kim et al., 2022; Teoh et al., 2024). Apart from feature importance, attention maps applied to time-series data also show promise in providing insights by highlighting significant time points or features during the processing of sequential information within neural networks (Xiang et al., 2024). Other XAI methods, such as Gradient-weighted Class Activation Mapping (Grad-CAM) (Selvaraju et al., 2017), have proven useful in highlighting the regions of gait video data that contribute most to model predictions, offering visual explanations that can be especially informative for clinicians (Slijepcevic et al., 2023; Martínez-Pascual et al., 2024; Teoh et al., 2024). Counterfactual explanations can also be applied to demonstrate how small changes in gait characteristics, such as joint angles or stride length, could affect model outputs, allowing for a deeper understanding of decision boundaries in gait anomaly detection (Dou et al., 2023). Although these methods have shown promise, their application in gait analysis remains limited, indicating a need for further exploration into which techniques best suit the specific demands of gait-related data and tasks.

Despite recent advancements, significant gaps remain in the application of XAI to gait analysis. Current ML applications in gait analysis emphasize predictive accuracy over interpretability, creating a gap that limits their clinical utility (Harris et al., 2022; Frasca et al., 2024). This lack of transparency can hinder the adoption of ML in clinical gait analysis, reducing its potential impact. Thus, there is a need for a comprehensive review of XAI approaches in gait analysis to assess their current capabilities, limitations, and areas for future improvement. By examining the intersection of XAI and gait analysis, this review aims to highlight the opportunities for advancing this field and uncover the potential of XAI to enhance decision-making in gait-related research and clinical practice.

2 Methods

The protocol for this systematic review was designed following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Moher et al., 2009) to ensure methodological rigor and transparency. Additionally, the review protocol was registered with the International Prospective Register of Systematic Reviews (PROSPERO) (CRD42024622752).

2.1 Search strategy

A systematic search was conducted across four electronic databases—Scopus, Web of Science, IEEE Xplore, and PubMed—covering the period from January 2000 to October 2024. The search employed keywords combined with Boolean operators, as outlined in Table 1. To ensure comprehensive coverage, the bibliographies of relevant academic articles were also reviewed for additional studies. Titles, abstracts, and full texts of the retrieved records were carefully screened to assess their relevance.

Table 1

Table 1. Boolean search strings employed for the corresponding bibliographic databases and search engines.

2.2 Eligibility criteria

Eligibility criteria were defined based on the Participants, Intervention, Comparisons, and Outcomes (PICO) framework to ensure systematic and consistent data extraction. This extraction focused on population characteristics (e.g., sample size, gender, age, health condition), gait analysis method, ML or deep learning model used, explainable methods (e.g., SHAP, LIME), ML task (e.g., classification, regression). outcomes measured, performance metrics.

The selection process was conducted independently by two reviewers (L.X. and Z.G.). Disagreements regarding study inclusion were resolved through discussion, and if consensus could not be reached, a third reviewer (J.F.) made the final decision. Studies were excluded based on the following criteria: (1) Studies that did not incorporate any explainability methods (e.g., SHAP, LIME) applied to black-box models, or that relied solely on inherently interpretable models (e.g., linear regression) without additional XAI components; (2) Focused exclusively on animal gait or used animal models without relevance to human gait analysis; (3) Lacked methodological details or experimental data; (4) Did not use quantitative measures to evaluate gait characteristics or relied solely on qualitative approaches without empirical analysis; and (5) Studies were excluded if the predictive models did not report performance metrics or failed to achieve a baseline level of validity, since explanations derived from poorly performing models would not provide meaningful or reliable insights into gait biomechanics. Search results from each database were imported into EndNote X9 (Thomson Reuters, California, USA) to manage references and streamline the screening process.

2.3 Risk of bias assessment

The risk of bias was assessed using a modified Downs and Black checklist (Downs and Black, 1998) adapted for sports science, healthcare, and rehabilitation studies and supplemented with XAI-specific items (Table 2). Two reviewers (L.X. and Z.G.) independently evaluated study quality, achieving >85% initial agreement. Given the small number of included studies, a formal inter-rater statistic (e.g., Cohen’s κ) was not calculated. The evaluation consisted of 12 distinct criteria, each rated on a scale of 0 (no), 1 (maybe), or 2 (yes), with a cumulative score ranging from 0 to 24. To ensure objectivity and standardization in quality assessment, the scores were converted to a percentage scale ranging from 0% to 100%.

Table 2

Table 2. Risk of bias assessment components.

3 Results

3.1 Search results

After removing 852 duplicate records, 3676 unique records were retained for further screening (Figure 1). During the initial screening phase, 3449 records were excluded for being irrelevant or failing to meet the inclusion criteria. Of these, 196 studies were excluded for specific reasons: lack of XAI approaches (n = 56), absence of human participants (n = 26), reliance on black-box models without interpretability (n = 29), insufficient methodological details or experimental data (n = 18), lack of quantitative analysis (n = 31), use of inherently interpretable models rather than post hoc explainability methods (n = 3), dimensionality-reduction–based latent space analysis without interpretability (n = 1), and low-performance predictive models (n = 32). Ultimately, 31 studies were included in the qualitative synthesis.

Figure 1

Flowchart outlining the selection process for studies in a systematic review. It starts with 4527 records identified from Scopus, PubMed, Web of Science, and IEEE Explore. 852 duplicates were removed, leaving 3675 records screened by title and abstract. 3449 reports were excluded. One additional record was added from references. 227 full-text articles were assessed for eligibility, with 196 excluded for various reasons, such as not using explainable machine learning or lacking human participants. Finally, 31 studies were included in the qualitative synthesis.

Figure 1. The Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) flow diagram of the study selection process.

3.2 Quality assessment

As summarized in supplementary data (Supplementary Table S1), the quality assessment scores of the 31 included studies ranged from 83.3% to 91.7%, with an average score of 88.7%. All studies followed standardized experimental design protocols, with an average methodology score of 1.82, and provided clear explanations of XAI techniques, achieving an average score of 1.58.

3.3 Study characteristics

The included studies were categorized into three primary methodological types of ML interpretability within gait analysis (Figure 2): model-agnostic (n = 16), model-specific (n = 12), and hybrid (n = 3).

Figure 2

Flowchart depicting Explainable AI (XAI) in gait analysis. Central categories include Model-Specific Methods, Model-Agnostic Methods, and Hybrid Methods. Subcategories cover Neural Network-Based Techniques, Global Interpretations, Local Interpretations, and Attention-Driven Interpretability. Techniques and methods such as SHAP, LIME, Grad-CAM, and ARL are detailed under each subcategory, highlighting their roles within interpretability frameworks.

Figure 2. Types of interpretable machine learning models used in the included studies. Note: XAI: Explainable Artificial Intelligence, LRP: Layer-Wise Relevance Propagation, Grad-CAM: Gradient-Based Class Activation Mapping, SFLDA: Sparse Functional Linear Discriminant Analysis, LIME: Local Interpretable Model-agnostic Explanations, SHAP: Shapley Additive Explanations, GMM-LIME: Gaussian Mixture Model - Local Interpretable Model-Agnostic Explanations, PFI: Permutation Feature Importance, ARL: Attention Reinforcement Learning.

3.3.1 Model-agnostic methods

A total 16 reviewed studies applied ML and XAI within gait analysis across diverse clinical and healthy populations (Dindorf et al., 2020; Davis et al., 2021; Kim et al., 2021; Kim et al., 2022; Rupprechter et al., 2021; Teufl et al., 2021; Kokkotis et al., 2022; Moon et al., 2022; Fan et al., 2023; Zheng et al., 2025; Aghababa and Andrysek, 2024; Hussain and Jany, 2024; Mulwa et al., 2024; Özateş et al., 2024; Trabassi et al., 2024; Wu et al., 2024) (Table 3). Among these, 14 studies employed local interpretation methods to explain instance-specific predictions, while two studies (Teufl et al., 2021; Trabassi et al., 2024) utilized global interpretation techniques to assess overall model behavior. Only one study reported the fidelity of local model explanations (Mulwa et al., 2024).

Table 3

Table 3. Model-agnostic interpretability methods in gait analysis using machine learning.

Local methods, such as SHAP and LIME, dominated the field, with SHAP applied in 11 studies to generate Shapley values for detailed insights into feature contributions (Davis et al., 2021; Kim et al., 2021; Kim et al., 2022; Rupprechter et al., 2021; Kokkotis et al., 2022; Fan et al., 2023; Zheng et al., 2025; Aghababa and Andrysek, 2024; Hussain and Jany, 2024; Trabassi et al., 2024; Wu et al., 2024). SHAP identified vertical ground reaction forces (GRF) and stride duration as key features for classifying Parkinson’s disease (PD) (Rupprechter et al., 2021) and sarcopenia (Kim et al., 2021). Similarly, LIME provided local feature importance scores for tabular and signal data in four studies, such as identifying functional ankle angles in foot pathologies (Özateş et al., 2024).

In contrast, global interpretation methods focused on dataset-wide feature importance. Teufl et al. (2021) applied permutation feature importance (PFI) to rank joint angles and range of motion (ROM) metrics for gait abnormality classification in hip arthroplasty patients, identifying pelvic tilt as the most influential feature. Trabassi et al. (2024) employed SHAP in a global context, aggregating Shapley values across cerebellar ataxia and healthy cohorts to reveal gait symmetry and cadence as key discriminators. These global approaches prioritized population-level insights, with PFI emphasizing feature robustness and SHAP aggregations highlighting biomechanical patterns.

The studies employed diverse data types and algorithms. Motion capture (Mocap) systems was the primary data acquisition method in the majority of studies (Dindorf et al., 2020; Davis et al., 2021; Kim et al., 2021; Teufl et al., 2021; Kokkotis et al., 2022; Fan et al., 2023; Aghababa and Andrysek, 2024; Mulwa et al., 2024; Özateş et al., 2024; Wu et al., 2024), providing time-series joint kinematics, GRF, and functional movement metrics. Wearable sensors (IMUs, EMG) enabled portable signal collection in five studies (Kim et al., 2022; Moon et al., 2022; Zheng et al., 2025; Hussain and Jany, 2024; Trabassi et al., 2024). Tabular gait parameters such as stride duration and cadence were frequently combined with signal data to enhance model performance. In terms of algorithms, ensemble methods (Random Forest, XGBoost) and neural networks [convolutional neural network (CNN), recurrent neural network (RNN)] were preferred for their robustness in handling heterogeneous data. Zheng et al. (2025) used CNNs to classify IMU-derived gait patterns in aging individuals, reporting an accuracy of 81.4% supported by SHAP-based local explanations to elucidate feature influence.

3.3.2 Model-specific methods

Model-specific interpretability techniques were applied in 12 studies focused on gait analysis (Horst et al., 2019; Horst et al., 2020; Aeles et al., 2021; Creagh et al., 2021; Filtjens et al., 2021; Slijepcevic et al., 2021; Slijepcevic et al., 2023; Apostolidis et al., 2023; Yoon et al., 2023; Alharthi, 2024; Kim et al., 2024; Xiang et al., 2024), and were classified into two main categories: (1) neural network-based methods, including Layer-wise Relevance Propagation (LRP), Grad-CAM, and attention mechanisms) and (2) interpretable statistical learning approaches including sparse functional linear discriminant analysis (SFLDA) (Table 4). These approaches were applied across diverse populations, including PD, cerebral palsy (CP), stroke survivors, and healthy participants, to elucidate model decisions and link them to biomechanically relevant features.

Table 4

Table 4. Model-specific interpretability techniques in machine learning for gait analysis.

3.3.2.1 Neural network-based interpretability techniques

LRP was the most frequently employed method (Horst et al., 2019; Horst et al., 2020; Aeles et al., 2021; Creagh et al., 2021; Filtjens et al., 2021; Slijepcevic et al., 2021; Alharthi, 2024), applied to CNNs and deep neural networks (DNNs) to explain classifications of gait abnormalities. Horst et al. (2019), Horst et al. (2020) and Alharthi (2024) used LRP to compute relevance scores for GRFs and joint angles, identifying asymmetrical loading patterns in PD patients. Similarly, Slijepcevic et al. (2021) utilized LRP to highlight GRF features distinguishing hip, knee, and ankle pathologies from healthy gait. These studies demonstrated LRP’s ability to localize biomechanically critical phases of the gait cycle, such as mid-stance and push-off, which are often altered in neuromuscular disorders.

Attention mechanisms and Grad-CAM were adopted to provide temporal and spatial interpretability for recurrent and convolutional architectures (Apostolidis et al., 2023; Slijepcevic et al., 2023; Kim et al., 2024; Xiang et al., 2024). Xiang et al. (2024) applied an attention-based LSTM-MLP model to regress joint torques and contact forces, with attention weights pinpointing key temporal segments in healthy gait. Grad-CAM, used in three studies (Apostolidis et al., 2023; Slijepcevic et al., 2023; Kim et al., 2024), generated gradient-weighted activation maps to identify discriminative features in IMU and kinematic data. Apostolidis et al. (2023) applied high-activation regions in chronic stroke survivors to compensatory pelvic tilt strategies during stance phase.

3.3.2.2 Interpretable statistical learning

SFLDA was employed by Yoon et al. (2023) as a transparent statistical method to classify gait patterns in CP. By enforcing sparsity constraints, SFLDA identified a minimal subset of discriminative joint angle features (e.g., hip flexion and knee abduction) that differentiated CP patients from controls. This approach provided clinically interpretable coefficients, enabling direct comparison with biomechanical literature on CP gait deviations.

3.3.2.3 Comparison of different approaches

Neural network-based techniques enabled the moding of nonlinear, spatiotemporal relationships in high-dimensional input such as GRFs and IMU data. LRP and Grad-CAM revealed nonlinear interactions in CNNs, such as phase-dependent coupling between ankle dorsiflexion and GRF peaks in PD (Alharthi, 2024). In contract, SFLDA provided linear, population-level biomarkers for CP (Yoon et al., 2023).

Across studies, gait data were primarily collected using motion capture systems and wearable sensors, with ground reaction forces and joint angles among the most commonly analyzed features. Interpretability metrics, such as relevance scores and activation maps, were often validated against clinical assessments, including freezing of gait (FOG) severity in PD (Filtjens et al., 2021) and CP Gross Motor Function Classification System levels (GMFCS) (Slijepcevic et al., 2023). A key limitation was the lack of standardized frameworks for translating relevance scores into actionable clinical insights, particularly in studies with limited sample sizes (Xiang et al., 2024).

3.3.3 Hybrid interpretability approaches

Three studies employed hybrid interpretability techniques to manage the complexity of gait analysis, combining attention mechanisms, graph-based modeling, and causal inference (Hou et al., 2022; Gu et al., 2024; Guo et al., 2024) (Table 5).

Table 5

Table 5. Hybrid interpretability approaches in machine learning applied to gait analysis.

3.3.3.1 Attention-driven interpretability

Hou et al. (2022) and Gu et al. (2024) combined attention mechanisms with architectural innovations to enhance transparency. Hou et al. (2022) employed a Squeeze-and-Excitation Mechanism (channel-wise attention) and Part-based Scoring (spatial attention) in a CNN to classify gait patterns from mocap-derived images, identifying key anatomical regions (e.g., knee flexion) via attention weights. Gu et al. (2024) applied attention reinforcement learning (ARL) within a graph neural network (GNN) to analyze time-series joint coordinate data from individuals with chronic ankle instability (CAI), with attention weights highlighting asymmetrical ankle kinematics during stance.

3.3.3.2 Causality-enhanced graph-based interpretability

Guo et al. (2024) introduced a causality-driven framework for classifying FOG in Parkinson’s disease. This model based on an Enhanced Graph Convolutional Network (GCN), the model integrated temporal-spatial graph convolutions (TSGCN) and multiple instance learning (MIL) to detect FOG episodes in video-based gait recording. A causal explanation framework quantified feature contributions, revealing that hip adduction and stride variability were causally linked to FOG onset. Performance metrics (accuracy = 92%, CES = 0.85) validated both predictive and explanatory power.

3.3.3.3 Data and clinical relevance

The hybrid methods utilized diverse data types, including images (Hou et al., 2022), time-series joint coordinates (Gu et al., 2024), and video (Guo et al., 2024), to address population-specific challenges (e.g., CAI, PD-FOG). Attention mechanisms improved granularity for localized features, causal inference provided biomechanically plausible explanations for complex gait pathologies.

4 Discussion

This systematic review explores the application of XAI methods in gait analysis, providing insights into their prevalence, effectiveness, and limitations. Our comprehensive search across four databases identified 31 relevant studies, which we categorized based on their interpretability approaches. The findings highlight the widespread use of model-agnostic methods, particularly global and local interpretation techniques, and emphasize the role of XAI in analyzing gait patterns, especially in clinical populations. Notably, most studies focused on persons with neurological and musculoskeletal conditions, showcasing the potential of XAI to improve clinical decision-making and rehabilitation strategies.

4.1 Prevalence and variety of XAI approaches

This review underscores the versatility of model-agnostic interpretability techniques, which were employed in 16 studies (Dindorf et al., 2020; Davis et al., 2021; Kim et al., 2021; Kim et al., 2022; Rupprechter et al., 2021; Teufl et al., 2021; Kokkotis et al., 2022; Moon et al., 2022; Fan et al., 2023; Zheng et al., 2025; Aghababa and Andrysek, 2024; Hussain and Jany, 2024; Mulwa et al., 2024; Özateş et al., 2024; Trabassi et al., 2024; Wu et al., 2024). These techniques, such as SHAP, LIME, LRP, and Grad-CAM, were particularly favored due to their flexibility in application across various ML models. Global interpretation methods, such as SHAP (Lundberg and Lee, 2017), identified population-level biomarkers, such as reduced cadence in aging adults (Aziz et al., 2024; Bassan et al., 2024). In contrast, local interpretation techniques, such as Grad-CAM, enabled case-specific visualizations of influential gait features, offering clinicians a more intuitive understanding of model predictions (Rupprechter et al., 2021). Grad-CAM has been used to visualize phase-specific muscle activation patterns in sarcopenia, aligning with clinical gait assessments (Kim et al., 2022). Feature importance methods were extensively used to rank biomechanical variables, such as stride length and joint angles, bridging the gap between model predictions and biomechanical understanding (Dindorf et al., 2020; Teufl et al., 2021; Mulwa et al., 2024; Özateş et al., 2024). LRP and attention mechanisms are extensively used with CNN and LSTM models due to their ability to highlight relevant input features and improve interpretability in temporal data (Binder et al., 2016; Niu et al., 2021). This approach facilitated the identification of crucial gait features contributing to specific conditions, aiding both researchers and clinicians in interpreting ML-generated results effectively. However, only one included study evaluated the fidelity of local explanations, i.e., the extent to which the explanation accurately represented the model’s decision logic.

4.2 Pathological gait

A significant proportion of the reviewed studies concentrated on individuals with neurological and musculoskeletal conditions, including PD, stroke, sarcopenia, cerebral palsy, and musculoskeletal injuries (Filtjens et al., 2021; Kim et al., 2021; Kim et al., 2022; Slijepcevic et al., 2023; Yoon et al., 2023; Alharthi, 2024; Guo et al., 2024; Hussain and Jany, 2024; Mulwa et al., 2024). The complexity and heterogeneity of gait impairments in these populations necessitate advanced analytical methods, making XAI a valuable tool in clinical settings. For example, XAI methods have been instrumental in identifying gait features associated with FOG in PD (Filtjens et al., 2021; Guo et al., 2024) or compensatory mechanisms in stroke survivors (Apostolidis et al., 2023; Hussain and Jany, 2024). These insights have significant implications for targeted rehabilitation strategies and clinical decision-making, as interpretable ML models can offer transparent predictions regarding therapeutic outcomes and disease progression (Saraswat et al., 2022; Aziz et al., 2024). The emphasis on non-healthy populations also highlights the need for XAI techniques that can accommodate the variability inherent in pathological gait patterns (Slijepcevic et al., 2021).

4.3 Sensor modalities and data types

The reviewed studies utilized diverse sensor modalities, including Mocap, IMUs, EMG, and force plate, each with unique implications for interpretability and model accuracy. Mocap systems were widely used due to their precision in kinematic data collection, making them ideal for studies requiring detailed biomechanical indices. However, their reliance on controlled laboratory environments limits real-world applicability (Xiang et al., 2022b). IMUs captured dynamic movement patterns (e.g., stride variability in PD) (Wu et al., 2024), but their susceptibility to motion artifacts limited relevance score consistency (Creagh et al., 2021). Hybrid approaches, such as fusing IMU data with Mocap-derived kinematics (Zheng et al., 2025), may mitigate these issues. EMG sensors provided valuable muscle activation data, enhancing the interpretability of ML models. However, they require meticulous placement and calibration, introducing variability that can affect reproducibility (Karamanidis et al., 2004).

4.4 Machine learning models and the challenge of interpretability

Tree-based ensemble methods were among the most frequently used ML techniques due to their robustness in handling high-dimensional gait data and partial interpretability (Hussain and Jany, 2024). Support Vector Machine (SVM) also appeared frequently, benefiting from their simplicity and effectiveness in classification tasks (Dindorf et al., 2020). Despite their advantages, these models struggle with capturing intricate gait dynamics compared to more complex deep learning models. While deep learning models provide superior predictive power, their “black-box” nature hinders their adoption in clinical applications where transparency is paramount. As Rudin (2019) emphasized, reliance on post hoc explanations for black-box models in high-stakes decision-making may perpetuate poor practices and even cause harm, whereas inherently interpretable models offer a more transparent and trustworthy alternative. This perspective underscores the importance of balancing predictive performance with interpretability in gait-related applications, where clinical trust is critical. Nevertheless, integrating XAI methods into deep learning models remains essential to ensure that clinicians and researchers can interpret and rely on the predictions generated by advanced ML approaches. Deep learning-specific XAI techniques, such as LRP, Grad-CAM, and attention mechanisms, are particularly valuable in this context (Horst et al., 2019; Slijepcevic et al., 2023; Xiang et al., 2024).

Interpretability-related metrics are inherently heterogeneous across XAI methods and thus difficult to quantify in a standardized way. For example, feature attribution approaches (e.g., SHAP, LIME) are often assessed with fidelity or sensitivity measures, while saliency-based methods (e.g., Grad-CAM) are more commonly evaluated through visual plausibility or human judgment. This lack of cross-method comparability highlights the need for a structured framework that combines (i) method-specific fidelity metrics, (ii) user-centered evaluation of explanation usefulness, and (iii) transparent reporting of model accuracy thresholds.

4.5 Interplay between prediction and interpretability

The interpretability of ML models in gait analysis depends significantly on the quality and relevance of input data. Motion capture data, with its high fidelity, is preferred for biomechanical studies requiring precise kinematic analyses, whereas IMU data is often sufficient for broader classification tasks where portability is prioritized (Xiang et al., 2022b; Xiang et al., 2024). The effectiveness of XAI techniques also varies based on the prediction task. Classification tasks, such as distinguishing between healthy and pathological gait, frequently rely on feature importance methods like SHAP and LIME, which highlight key predictive features (Dindorf et al., 2020; Rupprechter et al., 2021). In contrast, regression tasks, such as estimating joint torques or stride length, require techniques that capture continuous relationships between input and output variables. While classification tasks dominate the literature due to their straightforward data labeling and model evaluation, regression tasks present a unique challenge for explainability. Emerging techniques, such as attention mechanisms (Xiang et al., 2024) and LRP (Horst et al., 2019), show promise in improving interpretability for these models by identifying influential input factors and revealing complex biomechanical relationships.

4.6 Limitations and future directions

Despite the advantages of XAI, challenges remain in achieving meaningful interpretability and enhancing trust and transparency, particularly in clinical settings. Ghassemi et al. (2021) argue that the potential of XAI may be overstated, warning against an overreliance on explainability as a strict requirement for clinical deployment. They liken it to human decision-making, where we often trust our judgments without fully understanding the underlying neural mechanisms. Similarly, while ML models—especially deep learning—can be highly complex and opaque, this does not necessarily prevent their effective use in healthcare. Different XAI methods pose challenges for standardization. For instance, SHAP values and relevance scores vary in their interpretation, making it difficult to establish consistent benchmarks. A key limitation identified is that most studies reported the type of XAI method applied without providing quantitative measures such as fidelity, consistency, or stability scores. Future research should complement qualitative explanations with standardized fidelity metrics to better evaluate how well XAI methods reflect underlying model behavior and to facilitate comparisons across studies. Standardizing XAI metrics such as fidelity and faithfulness is essential to ensure robustness, efficiency, and clinical applicability of XAI in gait biomechanics.

5 Conclusion

In summary, XAI holds promise for improving transparency and fostering clinical trust in gait-related machine learning. However, its effective translation requires refinement and validation. Based on our findings, we propose a practical roadmap: (i) developers should report not only model accuracy but also method-specific fidelity or consistency metrics to demonstrate explanation reliability; (ii) researchers should adopt standardized reporting practices that specify the XAI approach, dataset type, and evaluation criteria; and (iii) clinicians should critically appraise whether explanations are interpretable and actionable in their decision-making context, and participate in user-centered evaluations of XAI tools. Advancing along these lines will accelerate the clinical utility of XAI in gait analysis and rehabilitation.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Author contributions

LX: Conceptualization, Formal analysis, Investigation, Resources, Data curation, Writing – original draft, Project administration. ZG: Conceptualization, Formal analysis, Investigation, Data curation, Visualization, Writing – original draft. PY: Validation, Writing – original draft. JF: Validation, writing – review and editing. YG: Validation, writing – review and editing. RW: Resources, writing – review and editing. EG-F: Resources, writing – review and editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This work was supported by the Digital Futures Postdoc Fellowship (KTH-RPROJ-0146472) and Promobilia Foundation (Ref. 23300).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fbioe.2025.1671344/full#supplementary-material

References

Adadi, A., and Berrada, M. (2018). Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE access 6, 52138–52160. doi:10.1109/access.2018.2870052