OpenMP-Annotated Code Dataset for Large Language Model Fine-Tuning on Parallel Programming Tasks

Etienne, Nichole; Garcia de Gonzalo, Simon; Arnold, Dorian

doi:10.3389/fhpcp.2026.1771927

DATA REPORT article

Front. High Perform. Comput.

Sec. Architecture and Systems

This article is part of the Research TopicEmerging Trends in Software Tools for Exascale Application DevelopmentView all 4 articles

OpenMP-Annotated Code Dataset for Large Language Model Fine-Tuning on Parallel Programming Tasks

Provisionally accepted

Nichole Etienne^1*

Simon Garcia de Gonzalo²

Dorian Arnold¹

¹Emory University, Atlanta, United States
²Sandia National Laboratories, Albuquerque, United States

The final, formatted version of the article will be published soon.

High-performance computing (HPC) plays a critical role in scientific discovery, engineering simulation, and data-intensive applications. OpenMP (Open Multi-Processing) is one of the most widely adopted shared-memory parallel programming interfaces, enabling developers to write multi-threaded applications in C, C++, and Fortran. However, correctly implementing OpenMP directives requires significant expertise, as developers must understand parallel programming concepts, data dependencies, and performance optimization strategies.Recent advances in large language models (LLMs) have demonstrated remarkable capabilities in code generation and completion tasks (Jiang et al., 2024;Huynh and Lin, 2025;Zan et al., 2023). These models, trained on vast corpora of source code, can assist developers by generating code snippets, completing partial implementations, and even translating natural language descriptions into executable programs. However, general-purpose code LLMs often struggle with domain-specific parallel programming constructs such as OpenMP pragmas. The scarcity of high-quality, task-specific training data for HPC code represents a significant barrier to developing effective AI assistants for parallel programming (Nichols et al., 2023). While existing code generation benchmarks focus primarily on sequential programming tasks in languages like Python and Java, there is limited availability of curated datasets specifically targeting parallel programming paradigms.This data report presents a curated dataset specifically designed for fine-tuning LLMs on OpenMP pragma completion tasks. The dataset contains 77,890 source files comprising over 15 million lines of code extracted from 387 GitHub repositories. Each training sample is structured to teach models the relationship between code context, loop structures, and appropriate OpenMP directives. This dataset addresses a critical gap in available resources for training AI models on parallel programming tasks and provides a foundation for developing intelligent code completion tools for HPC developers.The primary contributions of this dataset are: (1) a systematically collected corpus of real-world OpenMP code from active HPC projects, (2) structured annotations that isolate OpenMP pragmas and their associated loop contexts, and (3) comprehensive preprocessing and quality filtering to ensure dataset integrity. This resource enables researchers to develop and evaluate LLMs specifically tailored for parallel programming assistance. Data collection was conducted between June and July 2024 using the GitHub API. Source repositories were identified through a systematic query process targeting HPC-relevant codebases. The selection criteria were designed to ensure code quality and relevance to OpenMP development practices:• Primary languages: C and C++ (specified as repository primary language)• Repository topics: HPC, OpenMP, parallel-computing, scientific-computing, high-performancecomputing, computational-science, proxy-application, mini-app • Minimum stars: ≥3 (indicating community validation and active use (Borges and Valente, 2018))• Repository scope: Publicly accessible repositories with permissive licensesThe GitHub API query utilized a custom Python script that performed systematic searches combining repository topics and language filters (Gousios and Spinellis, 2012). Authentication was handled through GitHub personal access tokens to enable comprehensive repository access. The search process iterated through pagination (up to 34 pages per query) to exhaustively collect matching repositories, with ratelimiting safeguards to comply with API restrictions. This filtering strategy yielded 387 unique repositories, balancing dataset size with code quality (Cosentino et al., 2016;Munaiah et al., 2017). The minimum star threshold ensured that repositories had achieved some level of community review (Borges and Valente, 2018), while the topic filters specifically targeted HPC applications where OpenMP usage reflects realistic parallel programming patterns. OpenMP Dataset for LLM Fine-Tuning Identified repositories were cloned locally using GitPython library to create local copies organized by repository full name (owner/repository structure). This organization preserves provenance information for each source file.Source file extraction focused exclusively on C/C++ implementation and header files, identified by extensions: .c, .cc, .cpp, .cxx, .C, .h, .hh, .hpp, .H, .hxx, .Hxx, .HXX (Allamanis and Sutton, 2013).Recursive directory traversal proved approximately 2-3× faster than glob-based approaches for large directory structures. Files containing invalid path characters (brackets) were excluded to prevent filesystem conflicts.Initial extraction yielded 105,861 source files totaling 22,653,593 lines of code (0.71 GB). To ensure dataset quality and prevent training bias, a multi-stage preprocessing pipeline was implemented: Files containing non-UTF-8 characters were removed by attempting to read each file with UTF-8 encoding.Files that could not be decoded were excluded to prevent tokenization issues during model training. Two size constraints were applied:• Minimum token count: Files with fewer than 15 tokens (whitespace-delimited) were excluded as they typically contained only boilerplate or comments• File size limits: Files exceeding 1 MB were removed, as these typically represented embedded libraries, generated code, or raw data rather than human-authored source code Duplicate files are prevalent across GitHub repositories due to forking, vendored dependencies, and copied implementations (Markovtsev and Long, 2018). SHA-256 hashes of file contents were computed using a memory-efficient streaming approach. Files with identical content hashes were deduplicated, retaining only the first occurrence.After preprocessing, the dataset comprised 77,890 unique source files with 15,367,210 lines of code (0.49 GB), representing an 18% reduction from the initial collection primarily due to deduplication. The core dataset transformation extracts individual OpenMP parallel for constructs and formats them for pragma completion tasks. An automated extraction pipeline implements this transformation: A compiled regular expression (#pragma omp parallel for. * ) identified all OpenMP parallel for directives in each source file. The pattern uses multiline matching to handle pragmas spanning multiple lines. For each identified pragma, a bracket-matching algorithm extracted the complete associated loop structure:1. Locate the opening brace { following the pragma 2. Traverse the code incrementing a bracket stack counter for each { and decrementing for each } 3. Extract the complete loop body when the stack returns to zero 4. Exclude pragmas where bracket matching failed (typically single-statement loops without braces) C/C++ style comments (// and / * * /) were removed from pragma lines using pattern matching to isolate the actual directive syntax. This prevents the model from learning spurious comment patterns. Each sample includes a configurable amount of preceding code context, enabling the model to learn contextual patterns such as variable declarations, array definitions, and computational patterns that inform appropriate pragma selection. Extracted samples are annotated with special tokens to clearly delineate components:• and : Mark loop boundaries• and : Mark pragma boundaries This tokenization strategy enables the model to distinguish between code to be generated (pragma) and conditioning context (loop body). This dataset focuses on extracting #pragma omp parallel for directives and their variants (e.g., parallel for simd, parallel for reduction(...)) as prediction targets. The extraction captures various clauses including scheduling policies (schedule(static), schedule(dynamic), schedule(guided)), data-sharing (private, shared, reduction, firstprivate, lastprivate), synchronization (nowait), loop transformations (collapse), thread control (num threads), and ordering (ordered). Other OpenMP directive types such as target (for accelerator offloading), task/taskloop (for task-based parallelism), and the OpenMP 5.0+ loop directive are not systematically extracted as prediction targets, though they may appear in surrounding code context. The final dataset is stored in JSON Lines format (.jsonl), with each line representing a single training sample. Each sample contains:• Source file path: Full path to the original source file (enables provenance tracking)• Pragma directive: The extracted OpenMP pragma directive This format enables efficient streaming during training and preserves metadata for subsequent analysis or dataset versioning. This dataset is specifically designed for fine-tuning causal language models on OpenMP pragma completion tasks. The recommended training protocol involves:1. Tokenization: Process the annotated sample field using the target model's tokenizer with padding and truncation (recommended maximum length: 512 tokens) 2. Label preparation: Use standard causal language modeling where input sequences serve as both inputs and labels 3. Training objective: Cross-entropy loss on next-token prediction 4. Evaluation task: Given context and loop structure, generate the appropriate OpenMP pragmaThe dataset enables models to learn:• Relationships between computational patterns and parallelization strategies• Appropriate scheduling policies (static, dynamic, guided)• Data sharing clauses (private, shared, reduction)• Loop dependency analysis informing pragma selection The final OpenMP dataset exhibits the following characteristics (Table 1): The dataset spans multiple C/C++ file types, with the distribution of lines of code revealing the predominance of implementation files:• Implementation files (.c, .cpp, .cc, .cxx, .C): Majority of LOC, containing actual OpenMP-annotated compute kernels• Header files (.h, .hpp, .hh, .H): Lower LOC count, typically containing template implementations and inline functions This distribution reflects typical C/C++ project structure, where OpenMP pragmas appear predominantly in implementation files containing computational loops. The dataset captures diverse OpenMP usage patterns from real-world HPC applications:• Scheduling policies: Examples include static, dynamic, and guided scheduling with various chunk sizes• Data sharing clauses: Private variables, shared arrays, and reduction operations (e.g., reduction(+:s))• Nested parallelism: Some samples include nested parallel regions (though single-level pragmas dominate)• Loop variants: Both increment and decrement loops, varying iteration patternsThe pragma #pragma omp parallel for schedule(static) appears frequently, representing a common parallelization pattern for regular, independent loop iterations. More complex pragmas with reduction clauses reflect numerical computation patterns (e.g., dot products, norm calculations). The source repositories span diverse HPC application domains:• Scientific computing: Numerical methods, linear algebra operations (BLAS-level implementations)• Geospatial analysis: GIS applications with parallel raster processing• Proxy applications: Miniapps designed to represent computational kernels from large-scale simulations • Computational libraries: Reusable parallel algorithm implementations This domain diversity ensures the dataset captures varied computational patterns and parallelization strategies, rather than overfitting to specific application characteristics. Several quality features distinguish this dataset:1. Real-world code: Extracted from actively maintained repositories rather than synthetic examples Any dataset updates should be:• Deposited as independent versions in the repository• Documented with collection date ranges and repository counts• Published as Addendum articles linking to this initial Data Report

Keywords: Code generation, Dataset, Fine-tuning, high-performance computing, Large language models, OpenMP, Parallel programming, pragma completion

Received: 19 Dec 2025; Accepted: 21 Jan 2026.

Copyright: © 2026 Etienne, Garcia de Gonzalo and Arnold. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Nichole Etienne

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.