<?xml version="1.0" encoding="utf-8"?>
    <rss version="2.0">
      <channel xmlns:content="http://purl.org/rss/1.0/modules/content/">
        <title>Frontiers in High Performance Computing | Parallel and Distributed Software section | New and Recent Articles</title>
        <link>https://www.frontiersin.org/journals/high-performance-computing/sections/parallel-and-distributed-software</link>
        <description>RSS Feed for Parallel and Distributed Software section in the Frontiers in High Performance Computing journal | New and Recent Articles</description>
        <language>en-us</language>
        <generator>Frontiers Feed Generator,version:1</generator>
        <pubDate>2026-04-06T08:24:32.243+00:00</pubDate>
        <ttl>60</ttl>
        <item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2024.1473102</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2024.1473102</link>
        <title><![CDATA[Evaluation of work distribution schedulers for heterogeneous architectures and scientific applications]]></title>
        <pubdate>2024-12-10T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Marc Gonzalez Tallada</author><author>Enric Morancho</author>
        <description><![CDATA[This article explores and evaluates variants of state-of-the-art work distribution schemes adapted for scientific applications running on hybrid systems. A hybrid implementation (multi-GPU and multi-CPU) of the NASA Parallel Benchmarks - MutiZone (NPB-MZ) benchmarks is described to study the different elements that condition the execution of this suite of applications when parallelism is spread over a set of computing units (CUs) of different computational power (e.g., GPUs and CPUs). This article studies the influence of the work distribution schemes on the data placement across the devices and the host, which in turn determine the communications between the CUs, and evaluates how the schedulers are affected by the relationship between data placement and communications. We show that only the schedulers aware of the different computational power of the CUs and minimize communications are able to achieve an appropriate work balance and high performance levels. Only then does the combination of GPUs and CPUs result in an effective parallel implementation that boosts the performance of a non-hybrid multi-GPU implementation. The article describes and evaluates the schedulers static-pcf , Guided, and Clustered Guided to solve the previously mentioned limitations that appear in hybrid systems. We compare them against state-of-the-art static and memorizing dynamic schedulers. Finally, on a system with an AMD EPYC 7742 at 2.250GHz (64 cores, 2 threads per core, 128 threads) and two AMD Radeon Instinct MI50 GPUs with 32GB, we have observed that hybrid executions speed up from 1.1 × up to 3.5 × with respect to a non-hybrid GPU implementation.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2024.1444337</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2024.1444337</link>
        <title><![CDATA[Parallel and scalable AI in HPC systems for CFD applications and beyond]]></title>
        <pubdate>2024-10-01T00:00:00Z</pubdate>
        <category>Technology and Code</category>
        <author>Rakesh Sarma</author><author>Eray Inanc</author><author>Marcel Aach</author><author>Andreas Lintermann</author>
        <description><![CDATA[This manuscript presents the library AI4HPC with its architecture and components. The library enables large-scale trainings of AI models on High-Performance Computing systems. It addresses challenges in handling non-uniform datasets through data manipulation routines, model complexity through specialized ML architectures, scalability through extensive code optimizations that augment performance, HyperParameter Optimization (HPO), and performance monitoring. The scalability of the library is demonstrated by strong scaling experiments on up to 3,664 Graphical Processing Units (GPUs) resulting in a scaling efficiency of 96%, using the performance on 1 node as baseline. Furthermore, code optimizations and communication/computation bottlenecks are discussed for training a neural network on an actuated Turbulent Boundary Layer (TBL) simulation dataset (8.3 TB) on the HPC system JURECA at the Jülich Supercomputing Centre. The distributed training approach significantly influences the accuracy, which can be drastically compromised by varying mini-batch sizes. Therefore, AI4HPC implements learning rate scaling and adaptive summation algorithms, which are tested and evaluated in this work. For the TBL use case, results scaled up to 64 workers are shown. A further increase in the number of workers causes an additional overhead due to too small dataset samples per worker. Finally, the library is applied for the reconstruction of TBL flows with a convolutional autoencoder-based architecture and a diffusion model. In case of the autoencoder, a modal decomposition shows that the network provides accurate reconstructions of the underlying field and achieves a mean drag prediction error of ≈5%. With the diffusion model, a reconstruction error of ≈4% is achieved when super-resolution is applied to 5-fold coarsened velocity fields. The AI4HPC library is agnostic to the underlying network and can be adapted across various scientific and technical disciplines.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2024.1417040</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2024.1417040</link>
        <title><![CDATA[Runtime support for CPU-GPU high-performance computing on distributed memory platforms]]></title>
        <pubdate>2024-07-19T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Polykarpos Thomadakis</author><author>Nikos Chrisochoides</author>
        <description><![CDATA[IntroductionHardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware. This shift in the computing ecosystem offers many opportunities for performance improvement; however, it also increases the complexity of programming for such architectures.MethodsThis work introduces a runtime framework that enables effortless programming for heterogeneous systems while efficiently utilizing hardware resources. The framework is integrated within a distributed and scalable runtime system to facilitate performance portability across heterogeneous nodes. Along with the design, this paper describes the implementation and optimizations performed, achieving up to 300% improvement on a single device and linear scalability on a node equipped with four GPUs.ResultsThe framework in a distributed memory environment offers portable abstractions that enable efficient inter-node communication among devices with varying capabilities. It delivers superior performance compared to MPI+CUDA by up to 20% for large messages while keeping the overheads for small messages within 10%. Furthermore, the results of our performance evaluation in a distributed Jacobi proxy application demonstrate that our software imposes minimal overhead and achieves a performance improvement of up to 40%.DiscussionThis is accomplished by the optimizations at the library level and by creating opportunities to leverage application-specific optimizations like over-decomposition.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fhpcp.2024.1285349</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fhpcp.2024.1285349</link>
        <title><![CDATA[The fast and the capacious: memory-efficient multi-GPU accelerated explicit state space exploration with GPUexplore 3.0]]></title>
        <pubdate>2024-03-13T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Anton Wijs</author><author>Muhammad Osama</author>
        <description><![CDATA[The GPU acceleration of explicit state space exploration, for explicit-state model checking, has been the subject of previous research, but to date, the tools have been limited in their applicability and in their practical use. Considering this research, to our knowledge, we are the first to use a novel tree database for GPUs. This novel tree database allows high-performant, memory-efficient storage of states in the form of binary trees. Besides the tree compression this enables, we also propose two new hashing schemes, compact-cuckoo and compact multiple-functions. These schemes enable the use of Cleary compression to compactly store tree roots. Besides an in-depth discussion of the tree database algorithms, the input language and workflow of our tool, called GPUexplore 3.0, are presented. Finally, we explain how the algorithms can be extended to exploit multiple GPUs that reside on the same machine. Experiments show single-GPU processing speeds of up to 144 million states per second compared to 20 million states achieved by 32-core LTSmin. In the multi-GPU setting, workload and storage distributions are optimal, and, frequently, performance is even positively impacted when the number of GPUs is increased. Overall, a logarithmic acceleration up to 1.9× was achieved with four GPUs, compared to what was achieved with one and two GPUs. We believe that a linear speedup can be easily accomplished with faster P2P communications between the GPUs.]]></description>
      </item>
      </channel>
    </rss>