As the complexity and scale of scientific, AI, and simulation workloads continue to grow, modern high-performance computing (HPC) systems are evolving into distributed, heterogeneous environments that span traditional data centers, edge devices, and geo-distributed resources. This trend enables unprecedented capabilities for real-time analytics, digital twin systems, and AI-augmented simulations. However, it also introduces new challenges in scalability, fault tolerance, data management, and system-wide resilience. The ability to reliably operate and adapt across heterogeneous hardware, network, and software stacks is critical for future HPC systems.
Key challenges include ensuring fault-tolerance in dynamic, distributed environments; enabling federated learning across edge and HPC nodes while preserving privacy and performance; designing adaptive data reduction techniques to handle massive, real-time data streams; and developing robust frameworks for fault injection testing and resilience evaluation. Moreover, efficient resource management, scalable simulation frameworks, and intelligent data visualization techniques are essential for enabling next-generation applications—ranging from digital twins for scientific discovery to large-scale AI model training and multi-physics simulations.
This Research Topic seeks to explore novel solutions, frameworks, and architectures that enable scalable, distributed, and resilient computing across HPC systems and HPC-Edge Continuums. We aim to foster discussions that bridge theory and practice, enabling fault-tolerant and adaptive computing for emerging workloads, including new scientific applications, federated learning, real-time data analytics, and digital twins. Contributions that address both foundational challenges and real-world implementations are highly encouraged.
We welcome original research articles, reviews, perspectives, and case studies related to scalable, distributed, and resilient HPC systems. Topics of interest include, but are not limited to: • Scalable resource and job scheduling across HPC systems and HPC-Edge Continuums • Fault-tolerant architectures, frameworks, and fault injection techniques for resilience evaluation • Federated learning frameworks integrated with distributed HPC systems • Adaptive data reduction and compression techniques for large-scale analytics • Resilient and efficient communication protocols and network optimizations • Distributed data management, movement, and visualization for extreme-scale systems • AI/ML-driven approaches for fault detection, system optimization, and dynamic adaptation • Resilience and scalability challenges in simulation frameworks, digital twins, and geo-distributed computing • Performance and resilience benchmarking and system modeling for HPC environments
Submissions that combine theoretical innovations with practical system designs, prototypes, or case studies in real-world scientific and engineering applications are particularly encouraged.
Article types and fees
This Research Topic accepts the following article types, unless otherwise specified in the Research Topic description:
Brief Research Report
Community Case Study
Conceptual Analysis
Data Report
Editorial
FAIR² Data
Hypothesis and Theory
Methods
Mini Review
Articles that are accepted for publication by our external editors following rigorous peer review incur a publishing fee charged to Authors, institutions, or funders.
Article types
This Research Topic accepts the following article types, unless otherwise specified in the Research Topic description:
Important note: All contributions to this Research Topic must be within the scope of the section and journal to which they are submitted, as defined in their mission statements. Frontiers reserves the right to guide an out-of-scope manuscript to a more suitable section or journal at any stage of peer review.