SYSTEMATIC REVIEW article

Front. Cardiovasc. Med., 02 January 2026

Sec. Heart Failure and Transplantation

Volume 12 - 2025 | https://doi.org/10.3389/fcvm.2025.1659298

Artificial intelligence in electrocardiogram-based prediction of heart failure: a systematic review and meta-analysis

  • 1. Department of Cardiology, Pangang Group General Hospital, Panzhihua, China

  • 2. The 19th Batch of Chinese Medical Team in Sao Tome and Principe, Sichuan Provincial Health Commission, West China Hospital, Sichuan University, Chengdu, China

  • 3. Department of Anesthesiology, West China Hospital, Sichuan University, Chengdu, China

  • 4. Department of Stomatology, Yaan people’s Hospital, Yaan, China

  • 5. Department of Gynaecology and Obstetrics, Yaan people’s Hospital, Yaan, China

  • 6. Department of Outpatient Chengbei, The Affiliated Stomatological Hospital, Southwest Medical University, Luzhou, China

  • 7. Department of Ultrasound Medicine, Panzhihua Women & Enfants Healthcare Hospital, Panzhihua, China

  • 8. Department of Traditional Chinese Medicine, Panzhihua Central Hospital, Panzhihua, China

  • 9. Department of General Surgery & Laboratory of Gastric Cancer, State Key Laboratory of Biotherapy/Collaborative Innovation Center of Biotherapy and Cancer Center, West China Hospital, Sichuan University, Chengdu, China

  • 10. Gastric Cancer Center, West China Hospital, Sichuan University, Chengdu, China

Article metrics

View details

1,2k

Views

94

Downloads

Abstract

Background:

Heart failure (HF) continues to pose a significant global health challenge, characterized by an increasing prevalence. Early identification of individuals at the highest risk of developing HF and implementing interventions can prevent and delay disease progression. The application of artificial intelligence (AI) to electrocardiograms (ECGs) presents a novel strategy for early prediction; however, the effectiveness and generalizability of this approach necessitate systematic evaluation.

Objective:

To systematically evaluate the performance of AI models based on ECGS in predicting HF.

Methods:

This study was registered on PROSPERO (CRD420251012231). Following the PRISMA guidelines, we conducted a systematic literature search across multiple databases, including PubMed, IEEE Xplore, Medline, and Embase, for studies published between 2005 and 2025. The inclusion criteria focused on AI models based on ECGs that reported performance metrics such as the AUROC (Area Under the Receiver Operating Characteristic Curve)/C-statistic. Meta-analysis was performed by employing a random-effects model to evaluate the efficacy of AI in predicting HF through the pooled AUROC/ C-statistic. Additionally, we conducted heterogeneity analyses using I2 and performed subgroup comparisons across various ethnicities, while assessing the risk of bias with the PROBAST + AI tool.

Results:

A total of five studies involving 11 cohorts and 1,728,134 participants were included in the analysis. The pooled AUROC/C-statistic was found to be 0.76 (95% CI: 0.74–0.78; p < 0.001), indicating moderate-to-good discrimination capability. Subgroup analyses demonstrated consistent performance across different ethnic groups, with AUROC values ranging from 0.77 to 0.79, comparable to the traditional model which had an AUROC of 0.742 (95% CI: 0.692–0.787, P = 0.575). Notably, significant heterogeneity was observed among the studies (I2 = 89%, p < 0.01), which may be attributed to systematic differences in population characteristics, study design, and data quality.

Conclusions:

Theoretically, artificial intelligence-enabled electrocardiogram (AI-ECG) models demonstrate promising applicability for predicting HF; however, their effectiveness remains uncertain due to a high risk of bias and a lack of clinical validity studies.

Systematic Review Registration:

https://www.crd.york.ac.uk/PROSPERO/view/CRD420251012231, PROSPERO CRD420251012231.

1 Introduction

HF affects over 64 million people globally, with its incidence and prevalence continuing to rise (1). Patients with HF experience a decline in quality of life, an increased risk of mortality, and a significant economic burden (2, 3). Notably, identifying individuals at high risk for future HF can mitigate these risks through the early initiation of low-cost medical treatments, which has been demonstrated to be able to alter the disease trajectory in clinical practice guidelines, reduce the risk of new-onset clinical HF, and improve life expectancy (4, 5). Although strategies based on serum testing (6, 7) and clinical scoring (8, 9) to predict new-onset HF are feasible, their reliance on invasive blood sampling and the complexity of acquiring multivariable parameters substantially increase implementation challenges and associated costs. In stark contrast, AI-ECG analysis, which detects latent cardiovascular disease features via 12-lead ECGs, provides a non-invasive and cost-effective solution.

Pioneering work in the deep ECG phenotyping field has demonstrated that AI can effectively predict atrial fibrillation (10, 11). In recent years, research has also shown that AI-ECG exhibits good performance in predicting HF (1214); however, to our knowledge, despite continuous technological advancements, a comprehensive systematic evaluation of its performance for HF prediction is still lacking. Therefore, this study aims to assess its discriminatory ability and heterogeneity in prediction through a meta-analysis, providing clinicians with a more comprehensive understanding of the application of AI-ECG in HF prediction, thereby guiding future clinical practices and research directions.

2 Methods

2.1 Search strategy and inclusion criteria

This review adheres to the PRISMA statement and CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies (CHARMS) (15, 16) (Supplementary Table 1). Furthermore, this study has been registered with PROSPERO (CRD420251012231). A systematic search of the literature published from March 2005 to March 2025 was conducted in the PubMed, IEEE Xplore, Medline, and Embase databases. The search terms used were (“Artificial Intelligence” OR “Machine Learning” OR “Deep Learning”) AND (“Electrocardiogram” OR “ECG”) AND (“Heart Failure” OR “congestive heart failure” OR “cardiac insufficiency” OR “Cardiac Failure”) AND (“Prediction” OR “Early Detection”). The inclusion criteria were: (1) Artificial intelligence models developed solely based on 12-lead ECG data for the prediction of heart failure; (2) Reporting of discriminative performance metrics (such as AUROC, c-statistic) and 95% confidence intervals (CIs); (3) Original research published in English and subjected to peer review. The exclusion criteria encompassed commentary, Conference Paper, reviews, non-English literature, models integrating non-ECG variables (such as laboratory indicators), and studies that did not provide complete performance metrics.

2.2 Data extraction, assessment of quality and risk of bias

Two independent reviewers extracted data using a standardized form that captured study characteristics, including author, year, cohort, and sample size, as well as the AI model architecture and performance metrics. Discrepancies were resolved through consensus or third-party adjudication. The risk of bias and quality were assessed using the PROBAST + AI tool, which evaluates domains such as participants and data sources, predictors, outcomes, and analysis methodology (17). Studies were classified as having low, unclear, or high risk of bias based on predefined criteria. Cochran's Q test and the I2statistic were employed to quantify heterogeneity. A sensitivity analysis was conducted by sequentially removing individual studies based on a random effects model.

2.3 Data synthesis and statistical analysis

In a single study, we evaluated the c-statistic/AUROC of the model. When a study reported multiple cohorts and presented data for each cohort separately, we evaluated the model performance of each cohort in the study individually. Funnel plots were created to examine publication bias. We analyzed the discriminative ability through summary measures of the AUROC/c-statistic and the corresponding 95% CI. When the 95% confidence interval is not reported, we calculated it using the method described by Debray et al. (18). We calculated the 95% prediction interval (PI) to describe the degree of inter-study heterogeneity and indicate the possible range of the predicted model performance in new validations. Based on previous literature, summary AUROC/c-statistics were predefined as inadequate (<0.60), sufficient (0.60–0.70), acceptable (0.70–0.80), and excellent (>0.80) (19). We performed a meta-analysis using the metafor package in R (R Foundation for Statistical Computing 4.5.0). Our primary analysis evaluated the overall discriminative ability of all models for which cohorts reported AUROC/c-statistic data. In the secondary analysis, we compared the AI-ECG model with traditional FHS-HF/PCP-HF evaluation models. One researcher rated the certainty of the evidence for the primary results, while another researcher conducted a review.

3 Results

3.1 Study selection

A systematic literature search conducted across PubMed, IEEE Xplore, Medline, and Embase identified 154 initial records (Supplementary Table 2). After the removal of duplicates, 112 unique studies underwent title and abstract screening. Of these, 20 articles were excluded due to irrelevance and unavailability of full texts. The remaining 92 full-text articles were assessed for eligibility. A total of 87 studies were excluded for the following reasons: 52 were classified as reviews, commentaries, or conference papers; 16 studies did not integrate HF prediction models; 12 prediction models were not based on artificial intelligence or 12-lead ECGs; and two studies included non-ECG variables alongside ethnicity, gender, and age. Ultimately, five studies met all inclusion criteria and were included in the systematic review and meta-analysis (Figure 1). Applying the PICOTS framework to clarify the intended objectives or purposes of predictive model evaluation (Supplementary Table 3).

Figure 1

Flowchart depicting the identification and screening process of studies. Identification phase includes records from IEEE Xplore (77), PubMed (17), Medline (26), and Embase (34), totaling 154. Duplicate records removed: 42. Remaining records screened: 112. Records excluded: 15. Records sought for retrieval: 97. Not retrieved: 5. Assessed for eligibility: 92. Excluded afterwards: 87, due to reasons such as being review/commentary/conference papers (52), not heart failure models (16), not based on AI ECG (12), and including additional variables (2). Studies included in review: 5.

Flowchart of the process for including studies in the systematic review and meta-analysis.

3.2 Study characteristics

The analysis included five studies comprising 11 independent cohorts, totaling a sample size of 1,728,134 participants. The cohorts were predominantly from the United States (7 out of 11), with additional representation from Asia (1 out of 11, TSGH), Europe (1 out of 11, UKB), and South America (1 out of 11, ELSA-Brasil in Brazil). Sample sizes varied significantly across the cohorts, ranging from 6,736 participants in the MESA cohort to 539,934 in the TSGH cohort. The mean age of participants ranged from 51 years to 62.2 years, with female representation varying between 49.6% and 56.6%. Follow-up durations varied from 3.1 months to 6.8 years (Table 1).

Table 1

Study Cohort(country) HF risk prediction method Ethnicity HF cases/total patients (%) Age (years, mean ± SD) Female sex (%) Outcome Outcome coding Enrolment period (mean f/u in years) Exclusion criteria
Kaur et al. (2024) (20) SUMC(USA) Deep learning Overall (undifferentiated) 59,816/326,663 (18.3%) 59.3 (−) 162,190 (49.6%) HF SNOMED 2008–2018 (6.8) Lack of follow-up or prior HF
Asian 8,622/46,393 (18.6%) N/S N/S HF SNOMED 2008–2018 (6.8) Lack of follow-up or prior HF
Hispanic 7,042/40,301 (17.5%) N/S N/S HF SNOMED 2008–2018 (6.8) Lack of follow-up or prior HF
Non-Hispanic White 35,086/184,410 (19.0%) N/S N/S HF SNOMED 2008–2018 (6.8) Lack of follow-up or prior HF
Black or African American 3,066/13,063 (23.5%) N/S N/S HF SNOMED 2008–2018 (6.8) Lack of follow-up or prior HF
Akbilgic et al. (2021) (12) ARIC (USA) Deep learning Undifferentiated 803/14,613 (5.5%) 54(± 5) 7,978 (55%) HF ICD-9-CM 1987-1989 (−) Individuals with HF, those with missing HF data during follow-up, and those with missing or poor-quality ECGs
Dhingra et al. (2025) (13) YNHHS (USA) Deep learning Undifferentiated 9,645/231,285 (4.2%) 57(-) 130,941 (56.6%) HF/LVEF < 50% ICD-10-CM 2014–2023 (4.5) Prior HF diagnostic codes, abnormal echocardiogram findings (LVEF < 50% or severe diastolic dysfunction), and hospitalization for HF within 3 months after baseline ECG examination
UKB(UK) Deep learning Undifferentiated 46/42,141 (0.1%) 65 (−) 21,795 (51.7%) HF ICD-10-CM 2014–2021 (3.1) Hospitalization records before baseline ECG examination showed diagnostic codes for HF
ELSA-Brasil(BRA) Deep learning Undifferentiated 31/13,454 (0.2%) 51 (−) 7,348 (54.6%) HF C-EEMRR N/S(4.2) With pre-existing HF or baseline echocardiogram showing LVEF < 50%
Lin et al. (2025) (14) TSGH (CHN) Deep learning Asian (–)/1,078,629 60.2(± 14.5) 705,153 (49.8%) HF ICD-9/10-CM N/S(5) Under 30 years old; with pacemakers; without medical records in the system within one year after ECG; with a history of HF, MI, or IS; ECGs with acquisition dates matching the date of diagnosis
Butler et al. (2023) (21) ARIC (USA) Deep learning Undifferentiated 803/14,613 (5.5%) 54(± 5) 7,978 (55%) HF ICD-9-CM 1987–1989(N/S) Baseline HF; missing ECGs
MESA (USA) Deep learning Undifferentiated 239/6,736 (3.5%) 62.2(±10.2) 3,555 (52.8) HF ICD-9/10-CM or symptoms N/S Missing baseline ECGs

Characteristics of included studies.

AF, atrial fibrillation; SUMC, Stanford University Medical Center; ARIC, atherosclerosis risk in communities; YNHHS, Yale New Haven Health System; UKB, UK Biobank; ELSA-Brasil, Brazilian longitudinal study of adult health; TSGH, tri-service general hospital; MESA, multi-ethnic study of atherosclerosis; ECG, electrocardiogram; LVEF, left ventricular ejection fraction; SNOMED, systematized nomenclature of medicine; C-EEMRR, clinical events based on expert medical record review; N/S, not specified.

Table 2

Study Model Type Input Format Preprocessing Steps Validation Strategy
Kaur et al. (2024) (20 CNN-based Raw waveform Filtering, resampling Internal validation (ethnicity subgroups)
Akbilgic et al. (2021) (12) CNN-LSTM hybrid Raw waveform Noise removal, amplitude normalization Train-validation split
Dhingra et al. (2025) (13) CNN (image-based) ECG image Image standardization, lead extraction External validation (UKB, ELSA)
Lin et al. (2025) (14) Multitask DL Raw waveform Bandpass filtering, segmentation Temporal validation
Butler et al. (2023) (21) CNN-based Raw waveform Baseline wander removal, resampling Internal & external (MESA)

Methodological characteristics of AI-ECG models in included studies.

CNN, convolutional neural network; LSTM, long short-term memory; UKB, UK Biobank; ELSA-Brasil, Brazilian longitudinal study of adult health; MESA, multi-ethnic study of atherosclerosis; ECG, electrocardiogram.

All studies employed deep learning as the core model architecture. Two studies (12, 21) utilized the same ARIC cohort (n = 14,613) but reported distinct analyses. The outcome definitions varied; most studies relied on clinical HF diagnoses using ICD codes, while the YNHHS cohort (13) incorporated echocardiographic criteria, specifically left ventricular ejection fraction (LVEF) less than 50%. Exclusion criteria predominantly focused on pre-existing HF, missing ECGs, or incomplete follow-up data. Notably, none of the included studies evaluated clinical utility or cost-effectiveness, and all lacked external validation in geographically diverse settings. The largest cohort (14) derived data from electronic health records, which may introduce unmeasured confounders, whereas smaller cohorts, such as MESA, emphasized community-based populations with detailed phenotyping.

3.3 AI model methodological characteristics

Although all included studies employed deep learning architectures, there was considerable diversity in model design, input representation, and validation strategies, which may influence performance and generalizability (Table 2). Input formats varied between raw signal time-series and image-based ECG representations, which may affect feature extraction robustness. Preprocessing also differed, though most studies applied noise filtering and signal normalization. Only two studies (13, 21) included external validation cohorts, highlighting a general limitation in geographic and clinical generalizability.

3.4 Risk of bias assessment

Most AI-ECG models (80%) demonstrate a low overall risk of bias. The unclear risk primarily stems from ambiguous reporting of result intervals and a lack of procedures for handling missing data (Figure 2; Supplementary Table 4).

Figure 2

Bar chart showing risk of bias across five categories: Participants and data sources, Predictors, Outcome, Analysis, and Overall. The green color indicates low risk, orange indicates unclear risk, and red indicates high risk. Most categories show a low risk, with unclear risk present in Outcome, Analysis, and Overall. No high-risk areas are noted. A legend at the bottom explains the color coding.

Risk of bias across all included studies.

3.5 Heterogeneity and sensitivity analysis

Significant heterogeneity was observed across studies (I2 = 89%, Cochran's Q = 98.7, p < 0.001). Funnel plots were symmetrical but exhibited additional horizontal scatter (Figures 3), which is consistent with the presence of between-study heterogeneity. This heterogeneity was driven by three primary factors: (1) variability in outcome definitions (e.g., clinical HF vs. HF with LVEF < 50%); (2) diversity in data sources (hospital-based vs. community cohorts with differing baseline risk profiles); and (3) methodological inconsistencies in ECG preprocessing and model hyperparameter optimization. To assess the robustness of the pooled AUROC [0.76 (0.74–0.78)], we conducted a leave-one-out sensitivity analysis. After excluding any single study, the pooled estimates showed minimal changes (range: 0.75–0.77). When the Dhingra (13) (ELSA-Brasil) cohort with the highest AUROC was excluded, the pooled estimate slightly decreased to 0.75 [0.73–0.77]. Upon excluding the UKB cohort with the lowest precision, the confidence interval narrowed to 0.76 [0.74–0.78]. The sensitivity analysis confirmed the robustness of the pooled results.

Figure 3

Funnel plot of AUROC for heart failure prediction shows points distributed within and outside the inverted funnel shape. The x-axis represents AUROC values ranging from 0.65 to 0.85, and the y-axis represents standard error from zero to 0.05. The plot assesses study precision and potential bias.

Funnel plot of all models included in primary meta-analysis. AUROC, area under the receiver operating characteristic curve.

3.6 Heart failure outcome definition

The definition of HF, the primary outcome, varied across the included studies. Most studies defined HF based on clinical diagnoses recorded in hospital administrative databases using ICD codes. In contrast, one study (13), YNHHS cohort) applied a more stringent, imaging-based criterion, defining HF as a LVEF of less than 50%. This methodological distinction between broadly defined, clinically coded HF and a more specific, phenotypically defined HF was identified a priori as a major potential source of clinical heterogeneity.

3.7 Model performance

The pooled AUROC across all cohorts was 0.76 (95% CI: 0.74–0.78; p < 0.001), indicating moderate-to-good discriminatory ability for HF prediction (Figure 4). Subgroup analyses revealed nuanced variations: among ethnic subgroups, the highest AUROC was observed in Asian populations [0.78 (0.77–0.79), (20)], while Hispanic and non-Hispanic White cohorts showed comparable performance [0.79 (0.79–0.80) and 0.77 (0.77–0.78), respectively], suggesting minimal ethnic disparity. Undifferentiated cohorts (e.g., ARIC and MESA) exhibited consistent AUROCs (0.756–0.77) despite differences in sample size and HF prevalence (3.5%–5.5%). Geographic variability highlighted challenges in low-incidence settings [e.g., UKB: AUROC 0.769 (0.670–0.867), HF incidence 0.1%], whereas smaller cohorts with limited events (e.g., ELSA-Brasil: n = 31) achieved higher point estimates [AUROC 0.810 (0.714–0.907)] but with wider confidence intervals, underscoring the impact of sample size and event frequency on precision.

Figure 4

Forest plot showing c-statistic/AUROC for heart failure prediction using AI-ECG models. It includes several studies with varying event counts and confidence intervals. A red diamond indicates the summary estimate with a dashed line for prediction interval. The reference line is at a summary estimate of zero point seven six.

Primary analysis: meta-analysis of c-statistics/AUROC. SUMC, Stanford University Medical Center; ARIC, atherosclerosis risk in communities; YNHHS, Yale New Haven health system; UKB, UK Biobank; ELSA-Brasil, Brazilian longitudinal study of adult health; TSGH, tri-service general hospital; MESA, multi-ethnic study of atherosclerosis; AUROC, area under the receiver operating characteristic curve.

3.8 Comparison with traditional models

The head-to-head comparison between AI-ECG and traditional FHS-HF/PCP-HF risk scores across 5 cohorts from 3 studies demonstrated a pooled AUROC of 0.757 (95% CI: 0.730–0.782) for AI-ECG models vs. 0.742 (95% CI: 0.692–0.787) for traditional models, with overlapping confidence intervals and a non-significant test for subgroup differences (P = 0.575) indicating no consistent statistical superiority for the AI-ECG approach (Figure 5). Despite this overall equivalence, population-specific advantages emerged, as the AI-ECG model showed significantly superior performance in the large Asian TSGH cohort (AUROC 0.75 vs. 0.61, P < 0.001) and numerically higher discrimination in the MESA cohort, whereas performance remained comparable in the ARIC and other SUMC sub-cohorts. The substantial 95% prediction intervals for both AI-ECG (0.649–0.838) and traditional models (0.612–0.838) reflect considerable uncertainty in their relative performance in new settings. From a clinical perspective, the observed absolute AUROC difference of approximately 0.015, while not statistically significant at the overall level, may still hold relevance in population-wide screening contexts where even modest improvements in discrimination can meaningfully reclassify risk categories for a substantial number of individuals, potentially enabling more targeted preventive interventions.

Figure 5

Comparative forest plot for AI-ECG and FHS-HF/PCP-HF with 95% confidence intervals. Studies listed include Akbilgic 2021, Butler 2023, Lin 2025, and Dhingra 2025. Each study displays metric values for AI-ECG in blue diamonds and FHS-HF/PCP-HF in red circles, with summary estimates and prediction intervals. Horizontal axis represents metric values, ranging from 0.60 to 0.90.

Meta-analysis of AI-ECG model and FHS-HF/PCP-HFC model c-statistics/AUROC. SUMC, Stanford University Medical Center; ARIC, atherosclerosis risk in communities; MESA, multi-ethnic study of atherosclerosis; ELSA-Brasil, Brazilian longitudinal study of adult health; AUROC, area under the receiver operating characteristic curve; PCP-HF, pooled cohort equations to prevent HF; FHS, Framingham heart study; AI-ECG, artificial intelligence electrocardiogram.

3.9 Subgroup analysis by heart failure definition

Subgroup analysis based on heart failure (HF) definition revealed consistent performance of AI-ECG models across different diagnostic criteria. The diagnostic code-defined group (7 cohorts, n = 951,654) demonstrated a pooled AUROC of 0.765 (95% CI: 0.743–0.786), indicating robust discriminative ability (Figure 6). Substantial heterogeneity was observed (I2 = 89%), attributable to variations in population characteristics, study design, and healthcare settings. The imaging-defined group (LVEF < 50%, 1 cohort, n = 231,285) showed comparable performance with an AUROC of 0.810 (0.800–0.820). The complete overlap between the prediction interval of the diagnostic code group (0.692–0.826) and the point estimate of the imaging group suggests no statistically significant difference in model performance between definition types, though this interpretation is limited by the single study in the imaging subgroup.

Figure 6

Subgroup analysis graph of AI-ECG model performance by heart failure definition with studies/cohorts listed on the left. The x-axis represents AUROC with a ninety-five percent confidence interval ranging from 0.60 to 0.90. The graph includes blue circles for diagnostic code definitions, a red triangle for an imaging definition, and green squares for summary estimates. Each point displays an AUROC value with its confidence interval.

Subgroup analysis by heart failure definition: diagnostic code vs. imaging criteria. AUROC, area under the receiver operating characteristic curve.

These findings indicate that AI-ECG models maintain predictive utility regardless of HF definition methodology. The diagnostic code-based approach offers practical advantages for large-scale screening applications, while imaging-based definitions may provide more objective endpoint adjudication. The observed heterogeneity underscores the need for standardized outcome definitions and validation protocols in future multicenter studies. Further research should prioritize inclusion of additional imaging-defined cohorts to strengthen comparative analyses and evaluate definition-specific impacts on model calibration and clinical utility.

4 Discussion and conclusion

4.1 Principal findings and clinical implications

This systematic review and meta-analysis demonstrates that AI-ECG models possess moderate to good discriminatory ability for predicting heart failure (HF), with a pooled AUROC of 0.76 (95% CI: 0.74–0.78). This performance is comparable to established clinical risk scores such as the FHS-HF model (AUROC 0.742), suggesting that AI-ECG holds promise as a non-invasive and cost-effective tool. Its operational simplicity—leveraging ubiquitously available ECG data without the need for invasive blood tests or complex clinical parameters—positions it as an ideal candidate for rapid point-of-care screening and global cardiovascular risk reduction strategies, particularly in resource-limited areas.

4.2 Performance generalizability and subgroup analyses

Critically, the potential of AI-ECG is underscored by its consistent performance across ethnic groups in our subgroup analyses (AUROC range: 0.77–0.79). Notably, while the technology did not demonstrate universal superiority over traditional models, it showed enhanced performance in specific cohorts, such as the large Asian TSGH population. This variability suggests that AI-ECG may offer the greatest incremental value in populations with specific baseline risk profiles or where traditional risk factors are not fully captured. Therefore, AI-ECG should be viewed not as a blanket replacement for established scores, but as a complementary tool with particular utility in certain contexts.

4.3 Sources of heterogeneity and methodological challenges

Significant heterogeneity was observed across the included studies (I2 = 89%), which can be attributed to several methodological factors. First, inconsistent definitions of heart failure outcomes represent a principal source of variability. While the majority of studies relied on ICD codes extracted from electronic health records, others adopted imaging-confirmed criteria such as left ventricular ejection fraction (LVEF). This fundamental discrepancy in endpoint ascertainment not only limits the direct comparability of model performance but also introduces potential misclassification bias, particularly concerning heart failure with preserved ejection fraction (HFpEF).

Additionally, the diversity in ECG data sources and processing methodologies further contributes to the observed heterogeneity. The performance and generalizability of AI-based ECG models are intrinsically tied to the quality and consistency of the input data. In real-world settings, ECG signals are acquired from highly heterogeneous sources, including native digital recordings from standard 12-lead machines, digitized scans of paper printouts, or abbreviated waveforms from wearable monitoring devices. Consequently, disparities in acquisition hardware, sampling frequency, filter settings, and data format may introduce technical confounders that substantially influence model output.

Finally, cohort-specific performance variations underscore this inconsistency. For instance, larger cohorts derived from real-world electronic health record data demonstrated marginally lower AUROC values, potentially attributable to unmeasured confounders or data noise. Notably, the performance in specific subgroups, such as young Black women in one study (20), showed a significant decline. Collectively, these factors highlight the persistent challenges in methodological standardization and the necessity for unified protocols in data acquisition, model validation, and outcome reporting to improve reproducibility.

4.4 Limitations and imperatives for future research

Our analysis has several key limitations that chart a course for future work, including the absence of clinical utility assessments, the inherent “black-box” nature of deep learning models (22), and potential publication bias. To address these gaps, an urgent priority is the design and execution of large-scale, prospective, multi-center trials to determine whether AI-ECG-guided risk stratification actually reduces heart failure incidence or improves patient outcomes compared to standard care—a prerequisite for widespread adoption. Beyond establishing efficacy, future efforts must extend into implementation science, focusing on the seamless integration of these algorithms into clinical workflows. This entails developing interoperable systems that embed AI-ECG analysis within electronic health records and provide intuitive decision support to clinicians at the point of care. Furthermore, the sustainable implementation of AI-ECG hinges on demonstrating its long-term value and equity. This mandates rigorous health economic analyses, particularly in low-resource settings where its operational advantages could be most impactful, coupled with proactive measures to ensure fairness and transparency. Independent third-party validation and periodic auditing are indispensable to mitigate performance disparities across ethnic, and socioeconomic groups and to prevent the exacerbation of existing health inequities. Adopting privacy-preserving frameworks, such as federated learning, could further facilitate continuous model refinement across diverse institutions.

4.5 Conclusion

In conclusion, AI-ECG models demonstrate moderate-to-good discriminatory ability for predicting heart failure, with performance comparable to traditional risk models. The technology shows particular promise as a non-invasive, cost-effective screening tool, especially in environments where traditional risk factor collection is challenging. Future research should prioritize robust prospective validation, seamless clinical integration, and a steadfast commitment to equitable deployment. Ultimately, the goal is to identify the specific patient populations and clinical scenarios where AI-ECG provides the greatest incremental value, thereby enabling its transition from a research tool to a clinically actionable solution for personalized HF prevention.

Statements

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Author contributions

SZ: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Supervision, Visualization, Writing – original draft, Writing – review & editing. JJ: Conceptualization, Investigation, Software, Writing – original draft, Writing – review & editing. YL: Conceptualization, Investigation, Software, Writing – original draft, Writing – review & editing. GL: Writing – original draft, Writing – review & editing. SH: Writing – original draft, Writing – review & editing. SW: Writing – original draft, Writing – review & editing. CL: Writing – original draft, Writing – review & editing. HL: Writing – original draft, Writing – review & editing. NL: Writing – original draft, Writing – review & editing. LZ: Writing – original draft, Writing – review & editing.

Funding

The author(s) declared that financial support was not received for this work and/or its publication.

Conflict of interest

The authors declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declares that Generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fcvm.2025.1659298/full#supplementary-material

References

  • 1.

    GBD 2017 Causes of Death Collaborators. Global, regional, and national age-sex-specific mortality for 282 causes of death in 195 countries and territories, 1980-2017: a systematic analysis for the global burden of disease study 2017. Lancet. (2018) 392(10159):173688. 10.1016/S0140-6736(18)32203-7. Erratum in: Lancet. (2019) 393(10190):e44. doi: 10.1016/S0140-6736(19)31049-9. Erratum in: Lancet. (2018) 392(10160):2170. doi: 10.1016/S0140-6736(18)32833-2

  • 2.

    Celik A Ural D Sahin A Colluoglu IT Kanik EA Ata N et al Trends in heart failure between 2016 and 2022 in Türkiye (TRends-HF): a nationwide retrospective cohort study of 85 million individuals across entire population of all ages. Lancet Reg Health Eur. (2023) 33:100723. 10.1016/j.lanepe.2023.100723

  • 3.

    Conrad N Judge A Canoy D Tran J Pinho-Gomes AC Millett ERC et al Temporal trends and patterns in mortality after incident heart failure: a longitudinal analysis of 86 000 individuals. JAMA Cardiol. (2019) 4(11):110211. 10.1001/jamacardio.2019.3593

  • 4.

    McDonagh TA Metra M Adamo M Gardner RS Baumbach A Böhm M et al 2021 ESC guidelines for the diagnosis and treatment of acute and chronic heart failure: developed by the task force for the diagnosis and treatment of acute and chronic heart failure of the European Society of Cardiology (ESC) with the special contribution of the heart failure association (HFA) of the ESC. Eur J Heart Fail. (2022) 24(1):4131. 10.1002/ejhf.2333

  • 5.

    Heidenreich PA Bozkurt B Aguilar D Allen LA Byun JJ Colvin MM et al 2022 AHA/ACC/HFSA guideline for the management of heart failure: a report of the American College of Cardiology/American Heart Association joint committee on clinical practice guidelines. Circulation. (2022) 145(18):e8951032. 10.1161/CIR.0000000000001063

  • 6.

    Yan I Börschel CS Neumann JT Sprünker NA Makarova N Kontto J et al High-sensitivity cardiac troponin I levels and prediction of heart failure: results from the BiomarCaRE consortium. JACC Heart Fail. (2020) 8(5):40111. 10.1016/j.jchf.2019.12.008

  • 7.

    Neumann JT Twerenbold R Weimann J Ballantyne CM Benjamin EJ Costanzo S et al Prognostic value of cardiovascular biomarkers in the population. JAMA. (2024) 331(22):1898909. 10.1001/jama.2024.5596

  • 8.

    Khan SS Ning H Shah SJ Yancy CW Carnethon M Berry JD et al 10-year risk equations for incident heart failure in the general population. J Am Coll Cardiol. (2019) 73(19):238897. 10.1016/j.jacc.2019.02.057

  • 9.

    Pandey A Khan MS Patel KV Bhatt DL Verma S . Predicting and preventing heart failure in type 2 diabetes. Lancet Diabetes Endocrinol. (2023) 11(8):60724. 10.1016/S2213-8587(23)00128-6

  • 10.

    Khurshid S Friedman S Reeder C Di Achille P Diamant N Singh P et al ECG-based deep learning and clinical risk factors to predict atrial fibrillation. Circulation. (2022) 145(2):12233. 10.1161/CIRCULATIONAHA.121.057480

  • 11.

    Jabbour G Nolin-Lapalme A Tastet O Corbin D Jordà P Sowa A et al Prediction of incident atrial fibrillation using deep learning, clinical models, and polygenic scores. Eur Heart J. (2024) 45(46):492034. 10.1093/eurheartj/ehae595

  • 12.

    Akbilgic O Butler L Karabayir I Chang PP Kitzman DW Alonso A et al ECG-AI: electrocardiographic artificial intelligence model for prediction of heart failure. Eur Heart J Digit Health. (2021) 2(4):62634. 10.1093/ehjdh/ztab080

  • 13.

    Dhingra LS Aminorroaya A Sangha V Pedroso AF Asselbergs FW Brant LCC et al Heart failure risk stratification using artificial intelligence applied to electrocardiogram images: a multinational study. Eur Heart J. (2025) 46(11):104453. 10.1093/eurheartj/ehae914

  • 14.

    Lin CH Liu ZY Chu PH Chen JS Wu HH Wen MS et al A multitask deep learning model utilizing electrocardiograms for major cardiovascular adverse events prediction. NPJ Digit Med. (2025) 8(1):1. 10.1038/s41746-024-01410-3

  • 15.

    Page MJ McKenzie JE Bossuyt PM Boutron I Hoffmann TC Mulrow CD et al The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. PLoS Med. (2021) 18(3):e1003583. 10.1371/journal.pmed.1003583

  • 16.

    Moons KGM Groot JAH de Bouwmeester W Vergouwe Y Mallett S Altman DG et al Critical appraisal and data extraction for systematic reviews of prediction modelling studies: the CHARMS checklist. PLoS Med. (2014) 11(10):e1001744. 10.1371/journal.pmed.1001744

  • 17.

    Moons KGM Damen JAA Kaul T Hooft L Andaur Navarro C Dhiman P et al PROBAST+AI: an updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods. Br Med J. (2025) 388:e082505. 10.1136/bmj-2024-082505

  • 18.

    Debray TP Damen JA Snell KI Ensor J Hooft L Reitsma JB et al A guide to systematic review and meta-analysis of prediction model performance. BMJ. (2017) 356:i6460. 10.1136/bmj.i6460

  • 19.

    Lloyd-Jones DM . Cardiovascular risk prediction: basic concepts, current status, and future directions. Circulation. (2010) 121(15):176877. 10.1161/CIRCULATIONAHA.109.849166

  • 20.

    Kaur D Hughes JW Rogers AJ Kang G Narayan SM Ashley EA et al Race, sex, and age disparities in the performance of ECG deep learning models predicting heart failure. Circ Heart Fail. (2024) 17(1):e010879. 10.1161/CIRCHEARTFAILURE.123.010879

  • 21.

    Butler L Karabayir I Kitzman DW Alonso A Tison GH Chen LY et al A generalizable electrocardiogram-based artificial intelligence model for 10-year heart failure risk prediction. Cardiovasc Digit Health J. (2023) 4(6):18390. 10.1016/j.cvdhj.2023.11.003

  • 22.

    Xintian H Yuxuan H Luca F Larry AC Lior J Rajesh R. Deep learning models for electrocardiograms are susceptible to adversarial attack. Nat Med. (2020).

Summary

Keywords

artificial intelligence, electrocardiogram (ECG), heart failure, deep learning, predictive modelling, meta-analysis

Citation

Zhang S, Jiang J, Luo Y, Liu G, Hu S, Wan S, Luo C, Li H, Li N and Zhao L (2026) Artificial intelligence in electrocardiogram-based prediction of heart failure: a systematic review and meta-analysis. Front. Cardiovasc. Med. 12:1659298. doi: 10.3389/fcvm.2025.1659298

Received

15 September 2025

Revised

15 November 2025

Accepted

03 December 2025

Published

02 January 2026

Volume

12 - 2025

Edited by

Chim Lang, University of Dundee, United Kingdom

Reviewed by

Gian Luigi Nicolosi, San Giorgio Hospital, Italy

Ignatius Ivan, Siloam Hospitals, Indonesia

Updates

Copyright

* Correspondence: LinYong Zhao

ORCID LinYong Zhao orcid.org/0000-0003-0884-4657

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics