ORIGINAL RESEARCH article
Front. Med.
Sec. Regulatory Science
Volume 12 - 2025 | doi: 10.3389/fmed.2025.1598979
Automatic extraction of SmPC document for IDMP data model construction using foundation LLM and RAG: a preliminary experiment for pharmaceutical regulatory affairs
Provisionally accepted- 1Groupe ProductLife S.A, Courbevoie, France
- 2Pharma IT, Product Life Group Company, Copenhagen, Denmark
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
The pharmaceutical industry is undergoing a significant shift from traditional paper-based processes to data-driven approaches. This transition necessitates the adoption of structured data-exchange standards, such as the IDentification of Medicinal Products (IDMP), to improve harmonization, transparency, and interoperability across global regulatory landscapes.However, transforming unstructured data, such as Summary of Product Characteristics (SmPC) documents, into structured IDMP models presents considerable challenges in data extraction and standardization. Methods: We investigated the application of foundation Large Language Models (LLMs), namely Claude 3.5 Sonnet and Gemini 1.5 Flash, combined with Retrieval-Augmented Generation (RAG) techniques. We utilized various embedding models (generalist, specialized, and hybrid) and rule-based retrieval approaches. To enhance the accuracy of the extracted medicinal product information from SmPC documents, we evaluated multiple prompting strategies. Results: Our investigation showed that Claude 3.5 Sonnet significantly surpassed Gemini 1.5 Flash in performance. Additionally, RAG-type approaches with semantic research using embedding models were superior to rule-based methods overall. The choice of embedding models was essential depending on the type of information being extracted. Prompts that incorporated context, action, and examples were more effective than those based solely on role and steps. The approach achieved a BERT F1 score of up to 0.98 for the medicinal product section. Conclusion: Our proposed solution offers accurate and scalable data extraction from SmPCs, significantly advancing the informatization of pharmaceutical regulatory affairs. This demonstrates the real possibility of standardization, enhancing interoperability and harmonization within the industry. Our method contributes to regulatory compliance and data management for pharmaceutical companies, aiding in the streamlining of regulatory processes and promoting consistency.
Keywords: IDMP, SmPC Extraction, Pharmaceutical Regulatory Affairs, LLM, RAG, nlp, Data standardization, Regulatory Compliance
Received: 24 Mar 2025; Accepted: 07 Jul 2025.
Copyright: © 2025 KADI, Abdellatif, Kemajou Njamen and PEREME. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: HOCINE KADI, Groupe ProductLife S.A, Courbevoie, France
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.