<?xml version="1.0" encoding="utf-8"?>
    <rss version="2.0">
      <channel xmlns:content="http://purl.org/rss/1.0/modules/content/">
        <title>Frontiers in High Performance Computing | New and Recent Articles</title>
        <link>https://www.frontiersin.org/journals/high-performance-computing</link>
        <description>RSS Feed for Frontiers in High Performance Computing | New and Recent Articles</description>
        <language>en-us</language>
        <generator>Frontiers Feed Generator,version:1</generator>
        <pubDate>2026-04-19T21:27:57.16+00:00</pubDate>
        <ttl>60</ttl>
        <item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2026.1664774</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2026.1664774</link>
        <title><![CDATA[Toward energy-efficiency: CNTD_MERIC approach for energy-aware MPI applications]]></title>
        <pubdate>2026-04-08T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Kashaf Ad Dooja</author><author>Osman Yasal</author><author>Ondrej Vysocky</author><author>Lubomir Riha</author><author>Daniele Cesarini</author><author>Andrea Bartolini</author>
        <description><![CDATA[Energy efficiency is a major challenge in High-Performance Computing (HPC) systems, impairing their scale, performance, and sustainability. Despite technological and research progress, there is still a lack of software methods to measure and assess the energy efficiency of computing codes at scale. This is also exacerbated by the emergence of newer ISAs in the HPC computing spectrum with non-unified interfaces for power and energy monitoring. In this work, we present CNTD_MERIC, which integrates two state-of-the-art energy monitoring and optimization libraries for HPC systems, COUNTDOWN and MERIC. COUNTDOWN is an energy-aware runtime system for MPI applications. MERIC is a platform-agnostic runtime system and energy measurement library that optimizes energy efficiency by adjusting hardware configurations. CNTD_MERIC combines the benefits of these two approaches with low overhead, resulting in a portable power management runtime system for MPI applications. We evaluated the integrated library on both ARM and x86 compute nodes in the production environment of the IT4Innovations supercomputing center (IT4I). The results show that CNTD_MERIC achieves similar performance to the original COUNTDOWN and MERIC implementations in terms of energy optimization and power/energy measurement, with negligible overheads within −5% to +3% compared to the original COUNTDOWN configurations. We also implemented CNTD_MERIC for multi-architecture (x86 and ARM) comparison between Intel Sapphire Rapids and A64FX processors. The results indicate that A64FX achieves significantly lower execution time, reduced energy-to-solution, and lower average power consumption (110–132 vs. 400–590 W), confirming its efficiency for energy-efficient HPC systems.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2026.1778471</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2026.1778471</link>
        <title><![CDATA[Scalable foundation models for numerical simulations on HPC platforms]]></title>
        <pubdate>2026-03-26T00:00:00Z</pubdate>
        <category>Opinion</category>
        <author>Dali Wang</author><author>Qian Gong</author><author>Zirui Liu</author><author>Xiao Wang</author><author>Qinglei Cao</author><author>Scott Klasky</author>
        <description></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1709051</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1709051</link>
        <title><![CDATA[The Score-P performance tools ecosystem]]></title>
        <pubdate>2026-03-24T00:00:00Z</pubdate>
        <category>Technology and Code</category>
        <author>Christian Feld</author><author>Alexandru Calotoiu</author><author>Gregor Corbin</author><author>Markus Geimer</author><author>Marc-André Hermanns</author><author>Maximilian Knespel</author><author>Bernd Mohr</author><author>Jan André Reuter</author><author>Maximilian Sander</author><author>Pavel Saviankou</author><author>Marc Schlütter</author><author>Robert Schöne</author><author>Sameer S. Shende</author><author>Anke Visser</author><author>Bert Wesarg</author><author>William R. Williams</author><author>Felix Wolf</author><author>Brian J. N. Wylie</author><author>Mikhail Zarubin</author>
        <description><![CDATA[With the first exascale computing systems in production, tuning and scaling HPC applications to fully utilize the available hardware resources has become more important than ever. Thus, there is a strong need for software tools that assist application developers with this task. The Score-P instrumentation and measurement infrastructure plays a major role in filling this gap. Score-P is a community-driven, highly scalable tool suite for profiling and event tracing of massively parallel HPC application codes, and aimed to be easy to use. It provides measurement data via common data formats and runtime interfaces for a variety of complementary analysis tools developed by multiple institutions and companies, allowing users to gain insights into the communication, synchronization, input/output, and scaling behavior of their applications, pinpointing performance bottlenecks and their causes. In this article, we provide an overview of the current state of the Score-P infrastructure and its related tools ecosystem Cube, Extra-P, TAU, Scalasca, and Vampir. In particular, we detail Score-P's current design and architecture, both of which are highly flexible and extensible. Moreover, we describe how Score-P interacts with the analysis tools mentioned above and highlight the major extensions implemented over the past 10+ years to keep pace with the rapidly changing landscape of HPC hardware and parallel application programming interfaces. Furthermore, we discuss emerging challenges, particularly with respect to the ever-growing heterogeneity in both hardware and software, for collecting and analyzing performance data from applications running on future top-tier computing systems.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1714042</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1714042</link>
        <title><![CDATA[Extra-P—Empirical performance modeling made easy]]></title>
        <pubdate>2026-03-11T00:00:00Z</pubdate>
        <category>Technology and Code</category>
        <author>Alexandru Calotoiu</author><author>Marcin Copik</author><author>Fabian Czappa</author><author>Alexander Geiss</author><author>Gustavo de Morais</author><author>Marcus Ritter</author><author>Sergei Shudler</author><author>Torsten Hoefler</author><author>Felix Wolf</author>
        <description><![CDATA[High-performance computing (HPC) applications face challenges in achieving scalability, with bottlenecks often discovered only late in the development cycle. Performance modeling offers a means to predict and understand scalability, but analytical approaches require deep expertise and are often impractical for large, complex codes. To address this, the Extra-P project provides a user-friendly tool for empirical performance modeling, enabling automated model generation from a small number of carefully selected experiments. This paper presents an overview of Extra-P, its underlying methodology—the Performance Model Normal Form (PMNF)—and its evolution into a mature tool for detecting and analyzing scalability issues. We discuss strategies to reduce experiment costs through parameter selection, sparse modeling, and Gaussian process regression, as well as techniques for mitigating the impact of noise using iterative refinement and deep learning. Furthermore, we highlight novel use cases, including segmented modeling and validation of user expectations, and of course demonstrate how Extra-P can uncover hidden bottlenecks in real-world applications such as HOMME or MPI libraries. Finally, we outline the software's architecture and future directions, emphasizing the potential for integration with AI-driven methods and adaptation to increasingly heterogeneous hardware.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2026.1771927</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2026.1771927</link>
        <title><![CDATA[OpenMP-annotated code dataset for large language model fine-tuning on parallel programming tasks]]></title>
        <pubdate>2026-02-23T00:00:00Z</pubdate>
        <category>Data Report</category>
        <author>Nichole Etienne</author><author>Simon Garcia de Gonzalo</author><author>Dorian Arnold</author>
        <description></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1638924</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1638924</link>
        <title><![CDATA[Improving I/O phase predictions in FTIO using hybrid wavelet-Fourier analysis]]></title>
        <pubdate>2026-02-04T00:00:00Z</pubdate>
        <category>Brief Research Report</category>
        <author>Ahmad Tarraf</author><author>Felix Wolf</author>
        <description><![CDATA[With the growing complexity of I/O software stacks and the rise of data-intensive workloads, optimizing I/O performance is essential for enhancing overall system performance on HPC clusters. While many sophisticated I/O management approaches exist that try to alleviate I/O contention, they often rely on models that predict the future I/O behavior of applications. Yet, these models are often created from past execution runs and can be error-prone due to I/O variability. In this work, we propose an enhancement to an existing tool that leverages frequency-based techniques to characterize I/O phase. We explore methods to improve prediction accuracy by incorporating multiple frequency components. Furthermore, by coupling the wavelet transformation with the Fourier transformation, we enhance the precision of our predictions while maintaining a compact and efficient behavioral characterization. We demonstrate our approach using a deep learning benchmark executed on a production cluster.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1763887</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1763887</link>
        <title><![CDATA[Correction: Processor simulation as a tool for performance engineering]]></title>
        <pubdate>2026-01-08T00:00:00Z</pubdate>
        <category>Correction</category>
        <author>Frontiers Production Office </author>
        <description></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1669101</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1669101</link>
        <title><![CDATA[Processor simulation as a tool for performance engineering]]></title>
        <pubdate>2025-12-02T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Carlos Falquez</author><author>Shiting Long</author><author>Nam Ho</author><author>Estela Suarez</author><author>Dirk Pleiter</author>
        <description><![CDATA[The diversity of processor architectures used for High-Performance Computing (HPC) applications has increased significantly over the last few years. This trend is expected to continue for different reasons, including the emergence of various instruction set extensions. Examples are the renewed interest in vector instructions like Arm's Scalable Vector Extension (SVE) or RISC-V's RVV. For application developers, research software developers, and performance engineers, the increased diversity and complexity of architectures have led to the following challenges: Limited access to these different processor architectures and more difficult root cause analysis in case of performance issues. To address these challenges, we propose leveraging the much-improved capabilities of processor simulators such as gem5. We enhanced this simulator with a performance analysis framework. We extend available performance counters and introduce new analysis capabilities to track the temporal behaviour of running applications. An algorithm has been implemented to link these statistics to specific regions. The resulting performance profiles allow for the identification of code regions with the potential for optimization. The focus is on observables to monitor quantities that are usually not directly accessible on real hardware. Different algorithms have been implemented to identify potential performance bottlenecks. The framework is evaluated for different types of HPC applications like the molecular-dynamics application GROMACS, Ligra, which implements the breadth-first search (BFS) algorithm, and a kernel from the Lattice QCD solver DD-αAMG.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1638203</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1638203</link>
        <title><![CDATA[Toward a persistent event-streaming system for high-performance computing applications]]></title>
        <pubdate>2025-09-17T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Matthieu Dorier</author><author>Amal Gueroudji</author><author>Valérie Hayot-Sasson</author><author>Hai Duc Nguyen</author><author>Seth Ockerman</author><author>Renan Souza</author><author>Tekin Bicer</author><author>Haochen Pan</author><author>Philip Carns</author><author>Kyle Chard</author><author>Ryan Chard</author><author>Maxime Gonthier</author><author>Eliu Huerta</author><author>Ben Lenard</author><author>Bogdan Nicolae</author><author>Parth Patel</author><author>Justin Wozniak</author><author>Ian Foster</author><author>Nageswara S. Rao</author><author>Robert B. Ross</author>
        <description><![CDATA[High-performance computing (HPC) applications have traditionally relied on parallel file systems and file transfer services to manage data movement and storage. Alternative approaches have been proposed that use direct communications between application components, trading persistence and fault tolerance for speed. Event-driven architectures, as popularized in enterprise contexts, present a compelling middle ground, avoiding the performance cost and API constraints of parallel file systems while retaining persistence and offering impedance matching between application components. However, adapting streaming frameworks to HPC workloads requires addressing challenges unique to HPC systems. This paper investigates the potential for a streaming framework designed for HPC infrastructures and use cases. We introduce Mofka, a persistent event-streaming framework designed specifically for HPC environments. Mofka combines the capabilities of a traditional streaming service with optimizations tailored to the HPC context, such as support for massively multicore nodes, efficient scaling for large producer-consumer workflows, RDMA-enabled high-performance network communications, specialized network fabrics with multiple links per node, and efficient handling of large scientific data payloads. Built using the Mochi suite of HPC data service components, Mofka provides a lightweight, modular, and high-performance solution for persistent streaming in HPC systems. We present the architecture of Mofka and evaluate its performance against Kafka and Redpanda using benchmarks on diverse platforms, including Argonne's Polaris and Oak Ridge's Frontier supercomputers, showing up to 8× improvement in throughput in some scenarios. We then demonstrate its utility in several real-world applications: a tomographic reconstruction pipeline, a workflow for the discovery of metal-organic frameworks for carbon capture, and the instrumentation of Dask workflows for provenance tracking and performance analysis.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1393936</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1393936</link>
        <title><![CDATA[An analysis of the I/O semantic gaps of HPC storage stacks]]></title>
        <pubdate>2025-08-11T00:00:00Z</pubdate>
        <category>Hypothesis and Theory</category>
        <author>Sebastian Oeste</author><author>Patrick Höhn</author><author>Michael Kluge</author><author>Julian Kunkel</author>
        <description><![CDATA[Modern high-performance computing (HPC) Input/Output (I/O) systems consist of stacked hard- and software layers that provide interfaces for data access. Depending on application needs, developers usually choose higher layers with richer semantics for the ease of use or lower layers for performance. Each I/O interface on a given stack consists of a set of operations and their syntactic definition, as well as a set of semantic properties. To properly function, high-level libraries such as Hierarchical Data Format version 5 (HDF5) need to map their semantics to lower-level Application Programming Interface (API) such as Portable Operating System Interface (POSIX). Lower-level storage backends provide different I/O semantics than the layers in the stack above while sometimes implementing the same interface. However, most I/O interfaces do not transport semantic information through their APIs. Ideally, no semantics of an I/O operation should be lost while passing through the I/O stack, allowing lower layers to optimize performance. Unfortunately, there is a lack of general definition and unified taxonomy of I/O semantics. Similarly, system-level APIs offer little support for passing semantics to underlying layers. Thus, passing semantic information between layers is currently not feasible. In this article, we systematically compare I/O interfaces by examining their semantics across the HPC I/O stack. Our primary goal is to provide a taxonomy and comparative analysis, not to propose a new I/O interface or implementation. We propose a general definition of I/O semantics and present a unified classification of I/O semantics based on the categories of concurrent access, persistency, consistency, spatiality, temporality, and mutability. This allows us to compare I/O interfaces in terms of their I/O semantics. We show that semantic information is lost while traveling through the storage stack, which often prevents the underlying storage backends from making the proper performance and consistency decisions. In other words, each layer acts like a semantic filter for the lower layers. We discuss how higher-level abstractions could propagate their semantics and assumptions down through the lower-levels of the I/O stack. As a possible mitigation, we discuss the conceptual design of semantics-aware interfaces, to illustrate how such interfaces might address semantic loss—though we do not propose a concrete new implementation.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1570210</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1570210</link>
        <title><![CDATA[FlexNPU: a dataflow-aware flexible deep learning accelerator for energy-efficient edge devices]]></title>
        <pubdate>2025-06-26T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Arnab Raha</author><author>Deepak A. Mathaikutty</author><author>Shamik Kundu</author><author>Soumendu K. Ghosh</author>
        <description><![CDATA[This paper introduces FlexNPU, a Flexible Neural Processing Unit, which adopts agile design principles to enable versatile dataflows, enhancing energy efficiency. Unlike conventional convolutional neural network accelerator architectures that adhere to fixed dataflows (such as input, weight, output, or row stationary) to transfer activations and weights between storage and compute units, our design revolutionizes by enabling adaptable dataflows of any type through configurable software descriptors. Considering that data movement costs considerably outweigh compute costs from an energy perspective, the flexibility in dataflow allows us to optimize the movement per layer for minimal data transfer and energy consumption, a capability unattainable in fixed dataflow architectures. To further enhance throughput and reduce energy consumption in the FlexNPU architecture, we propose a novel sparsity-based acceleration logic that utilizes fine-grained sparsity in both the activation and weight tensors to bypass redundant computations, thus optimizing the convolution engine within the hardware accelerator. Extensive experimental results underscore a significant improvement in the performance and energy efficiency of FlexNPU compared to existing DNN accelerators.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1572844</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1572844</link>
        <title><![CDATA[FPGA innovation research in the Netherlands: present landscape and future outlook]]></title>
        <pubdate>2025-06-24T00:00:00Z</pubdate>
        <category>Review</category>
        <author>Nikolaos Alachiotis</author><author>Sjoerd van den Belt</author><author>Steven van der Vlugt</author><author>Reinier van der Walle</author><author>Mohsen Safari</author><author>Bruno Endres Forlin</author><author>Tiziano De Matteis</author><author>Zaid Al-Ars</author><author>Roel Jordans</author><author>António J. Sousa de Almeida</author><author>Federico Corradi</author><author>Christiaan Baaij</author><author>Ana-Lucia Varbanescu</author>
        <description><![CDATA[Field programmable gate arrays (FPGA) have transformed digital design by enabling versatile and customizable solutions that balance performance and power efficiency, yielding them essential for today's diverse computing challenges. Research in the Netherlands in both academia and industry plays a major role in developing new innovative FPGA solutions. This survey presents the current landscape of FPGA innovation research in the Netherlands by delving into ongoing projects, advancements, and breakthroughs in the field. Focusing on recent research outcome (within the past 5 years), we have identified five key research areas: (a) FPGA architecture, (b) FPGA robustness, (c) data center infrastructure and high-performance computing, (d) programming models and tools, and (e) applications. This survey provides in-depth insights beyond a mere snapshot of the current innovation research landscape by highlighting future research directions within each key area; these insights can serve as a foundational resource to inform potential national-level investments in FPGA technology.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1520151</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1520151</link>
        <title><![CDATA[FPGA-accelerated SpeckleNN with SNL for real-time X-ray single-particle imaging]]></title>
        <pubdate>2025-06-18T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Abhilasha Dave</author><author>Cong Wang</author><author>James Russell</author><author>Ryan Herbst</author><author>Jana Thayer</author>
        <description><![CDATA[We present the implementation of a specialized version of our previously published unified embedding model, SpeckleNN, for real-time speckle pattern classification in X-ray Single-Particle Imaging (SPI), using the SLAC Neural Network Library (SNL) on an FPGA platform. This hardware realization transitions SpeckleNN from a prototypic model into a practical edge solution, optimized for running inference near the detector in high-throughput X-ray free-electron laser (XFEL) facilities, such as those found at the Linac Coherent Light Source (LCLS). To address the resource constraints inherent in FPGAs, we developed a more specialized version of SpeckleNN. The original model, which was designed for broader classification across multiple biological samples, comprised ~5.6 million parameters. The new implementation, while reducing the parameter count to 64.6K (a 98.8% reduction), focuses on maintaining the model's essential functionality for real-time operation, achieving an accuracy of 90%. Furthermore, we compressed the latent space from 128 to 50 dimensions. This implementation was demonstrated on the KCU1500 FPGA board, utilizing 71% of available DSPs, 75% of LUTs, and 48% of FFs, with an average power consumption of 9.4W according to the Vivado post-implementation report. The FPGA performed inference on a single image with a latency of 45.015 microseconds at a 200 MHz clock rate. In comparison, running the same inference on an NVIDIA A100 GPU resulted in an average power consumption of ~73W and an image processing latency of around 400 microseconds. Our FPGA-accelerated version of SpeckleNN demonstrated significant improvements, achieving an 8.9 × speedup and a 7.8 × reduction in power consumption compared to the GPU implementation. Key advancements include model specialization and dynamic weight loading through SNL, which eliminates the need for time-consuming FPGA design re-synthesis, allowing fast and continuous deployment of models (re)trained online. These innovations enable real-time adaptive classification and efficient vetoing of speckle patterns, making SpeckleNN more suited for deployment in XFEL facilities. This implementation has the potential to significantly accelerate SPI experiments and enhance adaptability to evolving experimental conditions.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1550855</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1550855</link>
        <title><![CDATA[Resilient execution of distributed X-ray image analysis workflows]]></title>
        <pubdate>2025-06-06T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Hai Duc Nguyen</author><author>Tekin Bicer</author><author>Bogdan Nicolae</author><author>Rajkumar Kettimuthu</author><author>E. A. Huerta</author><author>Ian T. Foster</author>
        <description><![CDATA[Long-running scientific workflows, such as tomographic data analysis pipelines, are prone to a variety of failures, including hardware and network disruptions, as well as software errors. These failures can substantially degrade performance and increase turnaround times, particularly in large-scale, geographically distributed, and time-sensitive environments like synchrotron radiation facilities. In this work, we propose and evaluate resilience strategies aimed at mitigating the impact of failures in tomographic reconstruction workflows. Specifically, we introduce an asynchronous, non-blocking checkpointing mechanism and a dynamic load redistribution technique with lazy recovery, designed to enhance workflow reliability and minimize failure-induced overheads. These approaches facilitate progress preservation, balanced load distribution, and efficient recovery in error-prone environments. To evaluate their effectiveness, we implement a 3D tomographic reconstruction pipeline and deploy it across Argonne's leadership computing infrastructure and synchrotron facilities. Our results demonstrate that the proposed resilience techniques significantly reduce failure impact—by up to 500× —while maintaining negligible overhead (<3%).]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1537080</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1537080</link>
        <title><![CDATA[A SWIN-based vision transformer for high-fidelity and high-speed imaging experiments at light sources]]></title>
        <pubdate>2025-05-30T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Songyuan Tang</author><author>Tekin Bicer</author><author>Kamel Fezzaa</author><author>Samuel Clark</author>
        <description><![CDATA[IntroductionHigh-speed x-ray imaging experiments at synchrotron radiation facilities enable the acquisition of spatiotemporal measurements, reaching millions of frames per second. These high data acquisition rates are often prone to noisy measurements, or in the case of slower (but less noisy) rates, the loss of scientifically significant phenomena.MethodsWe develop a Shifted Window (SWIN)-based vision transformer to reconstruct high-resolution x-ray image sequences with high fidelity and at a high frame rate and evaluate the underlying algorithmic framework on a high-performance computing (HPC) system. We characterize model parameters that could affect the training scalability, quality of the reconstruction, and running time during the model inference stage, such as the batch size, number of input frames to the model, their composition in terms of low and high-resolution frames, and the model size and architecture.ResultsWith 3 subsequent low resolution (LR) frames and another 2 high resolution (HR) frames differing in the spatial and temporal resolutions by factors of 4 and 20, respectively, the proposed algorithm achieved an average peak signal-to-noise ratio of 37.40 dB and 35.60 dB.DiscussionFurther, the model was trained on the Argonne Leadership Computing Facility's Polaris HPC system using 40 Nvidia A100 GPUs, speeding up the end-to-end training time by about ~10 × compared to the training with beamline-local computing resources.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1611997</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1611997</link>
        <title><![CDATA[Editorial: Scientific workflows at extreme scales]]></title>
        <pubdate>2025-05-26T00:00:00Z</pubdate>
        <category>Editorial</category>
        <author>Anshu Dubey</author><author>Erik Draeger</author>
        <description></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1536501</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1536501</link>
        <title><![CDATA[A definition and taxonomy of digital twins: case studies with machine learning and scientific applications]]></title>
        <pubdate>2025-03-13T00:00:00Z</pubdate>
        <category>Review</category>
        <author>Adam Weingram</author><author>Carolyn Cui</author><author>Stephanie Lin</author><author>Samuel Munoz</author><author>Toby Jacob</author><author>Joshua Viers</author><author>Xiaoyi Lu</author>
        <description><![CDATA[As next-generation scientific instruments and simulations generate ever larger datasets, there is a growing need for high-performance computing (HPC) techniques that can provide timely and accurate analysis. With artificial intelligence (AI) and hardware breakthroughs at the forefront in recent years, interest in using this technology to perform decision-making tasks with continuously evolving real-world datasets has increased. Digital twinning is one method in which virtual replicas of real-world objects are modeled, updated, and interpreted to perform such tasks. However, the interface between AI techniques, digital twins (DT), and HPC technologies has yet to be thoroughly investigated despite the natural synergies between them. This paper explores the interface between digital twins, scientific computing, and machine learning (ML) by presenting a consistent definition for the digital twin, performing a systematic analysis of the literature to build a taxonomy of ML-enhanced digital twins, and discussing case studies from various scientific domains. We identify several promising future research directions, including hybrid assimilation frameworks and physics-informed techniques for improved accuracy. Through this comprehensive analysis, we aim to highlight both the current state-of-the-art and critical paths forward in this rapidly evolving field.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1536471</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1536471</link>
        <title><![CDATA[End-to-end deep learning pipeline for real-time Bragg peak segmentation: from training to large-scale deployment]]></title>
        <pubdate>2025-03-12T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Cong Wang</author><author>Valerio Mariani</author><author>Frédéric Poitevin</author><author>Matthew Avaylon</author><author>Jana Thayer</author>
        <description><![CDATA[X-ray crystallography reconstruction, which transforms discrete X-ray diffraction patterns into three-dimensional molecular structures, relies critically on accurate Bragg peak finding for structure determination. As X-ray free electron laser (XFEL) facilities advance toward MHz data rates (1 million images per second), traditional peak finding algorithms that require manual parameter tuning or exhaustive grid searches across multiple experiments become increasingly impractical. While deep learning approaches offer promising solutions, their deployment in high-throughput environments presents significant challenges in automated dataset labeling, model scalability, edge deployment efficiency, and distributed inference capabilities. We present an end-to-end deep learning pipeline with three key components: (1) a data engine that combines traditional algorithms with our peak matching algorithm to generate high-quality training data at scale, (2) a modular architecture that scales from a few million to hundreds of million parameters, enabling us to train large expert-level models offline while deploying smaller, distilled models at the edge, and (3) a decoupled producer-consumer architecture that separates specialized data source layer from model inference, enabling flexible deployment across diverse computing environments. Using this integrated approach, our pipeline achieves accuracy comparable to traditional methods tuned by human experts while eliminating the need for experiment-specific parameter tuning. Although current throughput requires optimization for MHz facilities, our system's scalable architecture and demonstrated model compression capabilities provide a foundation for future high-throughput XFEL deployments.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2024.1303358</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2024.1303358</link>
        <title><![CDATA[Nek5000/RS performance on advanced GPU architectures]]></title>
        <pubdate>2025-02-21T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Misun Min</author><author>Yu-Hsiang Lan</author><author>Paul Fischer</author><author>Thilina Rathnayake</author><author>John Holmen</author>
        <description><![CDATA[The authors explore performance scalability of the open-source thermal-fluids code, NekRS, on the U.S. Department of Energy's leadership computers, Crusher, Frontier, Summit, Perlmutter, and Polaris. Particular attention is given to analyzing performance and time-to-solution at the strong-scale limit for a target efficiency of 80%, which is typical for production runs on the DOE's high-performance computing systems. Several examples of anomalous behavior are also discussed and analyzed.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1520207</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2025.1520207</link>
        <title><![CDATA[Energy-aware operation of HPC systems in Germany]]></title>
        <pubdate>2025-02-19T00:00:00Z</pubdate>
        <category>Review</category>
        <author>Estela Suarez</author><author>Hendryk Bockelmann</author><author>Norbert Eicker</author><author>Jan Eitzinger</author><author>Salem El Sayed</author><author>Thomas Fieseler</author><author>Martin Frank</author><author>Peter Frech</author><author>Pay Giesselmann</author><author>Daniel Hackenberg</author><author>Georg Hager</author><author>Andreas Herten</author><author>Thomas Ilsche</author><author>Bastian Koller</author><author>Erwin Laure</author><author>Cristina Manzano</author><author>Sebastian Oeste</author><author>Michael Ott</author><author>Klaus Reuter</author><author>Ralf Schneider</author><author>Kay Thust</author><author>Benedikt von St. Vieth</author>
        <description><![CDATA[High Performance Computing (HPC) systems are among the most energy-intensive scientific facilities, with electric power consumption reaching and often exceeding 20 Megawatts per installation. Unlike other major scientific infrastructures such as particle accelerators or high-intensity light sources, which are few around the world, the number and size of supercomputers are continuously increasing. Even if every new system generation is more energy efficient than the previous one, the overall growth in size of the HPC infrastructure, driven by a rising demand for computational capacity across all scientific disciplines, and especially by Artificial Intelligence (AI) workloads, rapidly drives up the energy demand. This challenge is particularly significant for HPC centers in Germany, where high electricity costs, stringent national energy policies, and a strong commitment to environmental sustainability are key factors. This paper describes various state-of-the-art strategies and innovations employed to enhance the energy efficiency of HPC systems within the national context. Case studies from leading German HPC facilities illustrate the implementation of novel heterogeneous hardware architectures, advanced monitoring infrastructures, high-temperature cooling solutions, energy-aware scheduling, and dynamic power management, among other optimisations. By reviewing best practices and ongoing research, this paper aims to share valuable insight with the global HPC community, motivating the pursuit of more sustainable and energy-efficient HPC architectures and operations.]]></description>
      </item>
      </channel>
    </rss>