ORIGINAL RESEARCH article

Front. Earth Sci.

Sec. Geohazards and Georisks

Volume 13 - 2025 | doi: 10.3389/feart.2025.1642791

This article is part of the Research TopicNatural Disaster Prediction Based on Experimental and Numerical MethodsView all 21 articles

Enhancing Landslide Dam Stability Prediction: A Data-Driven Framework Integrating Missing Data Imputation and Optimal Threshold Discrimination

Provisionally accepted
  • 1Shanxi Vocational University of Engineering Science and Technology, Jinzhong, China
  • 2China Academy of Building Research, Beijing, China
  • 3Hebei University of Technology, Beichen District, China
  • 4Hebei University of Technology, Tianjin, China
  • 5Hebei Polytechnic of Building Materials, Qinhuangdao, China

The final, formatted version of the article will be published soon.

Accurate prediction of the stability of landslide dams is crucial in preventing and mitigating potential threats to downstream communities. However, the development of reliable predictive models is hindered by incomplete landslide dam inventory datasets due to missing data. To overcome this challenge, this study proposes a datadriven approach that incorporates missing data imputation to enhance the applicability and accuracy of landslide dam stability predictions. On the basis of the collected landslide dam inventory containing 518 cases with a probability of missing rate of 25%, various data imputation methods including generative adversarial imputation Nets (GAIN), missForest, multiple imputations by chained equations (MICE), K-nearest neighbors (KNN) and mean most-frequency (MMF) were used to estimate missing values to improve the completeness of the datasets.The imputed datasets were used to predict the stability of landslide dams via various machine learning approaches (support vector machine (SVM), random forests (RF), extreme gradient boosting (XGBoost), and logistic regression (LR)). Our key innovation lies in coupling GAIN with SVM, enhanced by Youden-index-based threshold optimization for stability classification. Key results demonstrate GAIN's superiority: it achieved the lowest RMSE (0.205) for continuous variables and 66.0% accuracy for categorical data. The GAIN-SVM combination yielded the highest predictive performance (AUC = 0.823), outperforming traditional methods by 15.2%. The Youden-index further improved classification accuracy by 3.1-9.3% for ambiguous cases (probabilities ~0.5), addressing a critical gap in existing models. This framework enables rapid stability assessments even with incomplete field data, providing critical support for emergency decision-making in landslide-prone regions. It also allows reliable risk assessments in data-scarce regions, supporting timely hazard mitigation decisions.

Keywords: Landslide dam stability, Missing data imputation, Generative adversarial imputation nets, machine learning, Threshold optimization, Geohazard risk assessment

Received: 07 Jun 2025; Accepted: 10 Jul 2025.

Copyright: © 2025 Li, Zhang, He, Song and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Jun He, Hebei University of Technology, Beichen District, China

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.