AUTHOR=Kadi Hocine , Abdellatif Alaa , Kemajou Njamen Daniel Isaac , Pereme Florian TITLE=Automatic extraction of SmPC document for IDMP data model construction using foundation LLM and RAG: a preliminary experiment for pharmaceutical regulatory affairs JOURNAL=Frontiers in Medicine VOLUME=Volume 12 - 2025 YEAR=2025 URL=https://www.frontiersin.org/journals/medicine/articles/10.3389/fmed.2025.1598979 DOI=10.3389/fmed.2025.1598979 ISSN=2296-858X ABSTRACT=IntroductionThe pharmaceutical industry is undergoing a significant shift from traditional paper-based processes to data-driven approaches. This transition necessitates the adoption of structured data-exchange standards, such as the IDentification of Medicinal Products (IDMP), to improve harmonization, transparency, and interoperability across global regulatory landscapes. However, transforming unstructured data, such as Summary of Product Characteristics (SmPC) documents, into structured IDMP models presents considerable challenges in data extraction and standardization.MethodsWe investigated the application of foundation Large Language Models (LLMs), namely Claude 3.5 Sonnet and Gemini 1.5 Flash, combined with Retrieval-Augmented Generation (RAG) techniques. We utilized various embedding models (generalist, specialized, and hybrid) and rule-based retrieval approaches. To improve the precision of the information extracted from the medicinal product from the SmPC documents, we evaluated multiple prompting strategies.ResultsOur investigation showed that Claude 3.5 Sonnet significantly surpassed Gemini 1.5 Flash in performance. Additionally, RAG-type approaches with semantic research using embedding models were superior to rule-based methods overall. The choice of embedding models was essential depending on the type of information being extracted. Prompts that incorporated context, action, and examples were more effective than those based solely on role and steps. The approach achieved a BERT F1 score of up to 0.98 for the medicinal product section.ConclusionOur findings demonstrate that the proposed LLM-RAG approach enables accurate and scalable extraction of structured data from SmPCs. This supports the digital transformation of regulatory processes by promoting standardization, interoperability, and harmonization. If implemented effectively, the method could help pharmaceutical companies improve regulatory compliance, streamline submissions, and improve data consistency.