Frontiers | Advances in Fault Tolerance for Large-scale HPC Systems

About this Research Topic

This Research Topic is still accepting articles.

1. Check author guidelines

Background

HPC (High performance computing) has been playing a major role in modelling and simulations of grand challenge problems in many different scientific domains for many decades. The significant science outcomes that HPC has yielded in the different domains has encouraged the community to build larger and larger HPC systems for exploration of large-scale problems. The advent of the exascale era, marked by the deployment of the Frontier system in 2022, highlights the recent expansion in HPC capabilities. Such large-scale systems are built using intricate integration of a large number of different components including processors, memory units, accelerators, interconnects, storage systems and many others, manifesting in architectures with high complexities. Such large-scale systems with high complexities invariably lead to faults in the components which increase multiple folds with increasing system sizes. It has been shown that in modern-day large-scale systems, the mean time between failures of any components can be as low as less than an hour. Executions in the presence of these faults, if not addressed, will lead to application failures resulting in wasted resources and energy consumption up to 20% larger as has been shown. Thus, the HPC community faces a clear and present challenge to provide sustained long-running executions for grand challenge applications for which larger systems are continued to be built. Accordingly, fault tolerance in large-scale HPC systems is identified as one of the top 10 exascale research challenges as per the DoE, USA, report published in 2014.

Providing fault tolerance requires a comprehensive set of research methods addressing several challenges. The solutions for fault tolerance should be developed such that they result in low performance and energy overheads. Mechanisms that rely on predictions of faults will have to deal with increasing difficulty in the predictions on systems that keep evolving with myriads of components of increasing complexity. While the early efforts were predominantly related to solutions for hardware failures, finding solutions for software-related faults and silent data corruptions (SDCs) has become increasingly essential in recent years due to the demands for accuracy in Machine Learning and allied fields, and as well as in numerical methods. As systems are built with smaller transistors and higher circuit density, bit flips and SDCs have become common place. In addition to these hardware-related causes, software complexities and varying reliability across inputs have also placed high focus on the SDCs.

This Research Topic will cover and invites papers that provides research methodologies on various aspects of fault tolerance for both hardware and software related faults. To gather further insights in effective fault management strategies scalable to exascale environments, we welcome articles addressing, but not limited to, the following themes:
• Checkpointing/Recovery optimization techniques such as asynchronous and multi-level checkpointing
• Proactive fault tolerance techniques including live migration and just-in-time checkpointing
• Alternative techniques including replication strategies and algorithm-based fault tolerance etc.
• Resilience techniques across diverse platforms such as traditional HPC setups, cloud frameworks, and edge devices.
• Application-specific resilience techniques, particularly for AI applications.
• Best practices for maintenance and usage of failure logs in HPC systems.
• Analyzing failure logs, diagnosing and characterizing failures, correlating failure events.
• Predictive models for component, node, and system failures with sufficient lead times.
• Strategies for the analysis and mitigation of silent data corruptions (SDCs) and other software-related faults, especially in AI applications.

Preference will be given for those works that bring multiple of the above items together towards development of comprehensive fault tolerance frameworks.

Topic Editor Sathish Vadhiyar has a funded project with Shell India focused on parallel frameworks for AI that involves developing fault tolerance for AI models. All other Topic Editors declare no conflicts of interest.

Article types and fees

This Research Topic accepts the following article types, unless otherwise specified in the Research Topic description:

Brief Research Report
Community Case Study
Conceptual Analysis
Data Report
Editorial
FAIR² Data
Hypothesis and Theory
Methods
Mini Review

Articles that are accepted for publication by our external editors following rigorous peer review incur a publishing fee charged to Authors, institutions, or funders.

Keywords: Fault tolerance, checkpointing, soft errors or silent data corruption, replication, failure logs and predictions

Important note: All contributions to this Research Topic must be within the scope of the section and journal to which they are submitted, as defined in their mission statements. Frontiers reserves the right to guide an out-of-scope manuscript to a more suitable section or journal at any stage of peer review.

Topic editors

Share on

Frontiers in High Performance Computing

Parallel and Distributed Software

Manuscripts can be submitted to this Research Topic via the main journal or any other participating journal.

Impact

747Topic views

View impact

Advances in Fault Tolerance for Large-scale HPC Systems

About this Research Topic

Background

Article types and fees

Topic editors

sathish vadhiyar

zizhong chen

bogdan nicolae

Frontiers in High Performance Computing

Parallel and Distributed Software