Your new experience awaits. Try the new design now and help us make it even better

SYSTEMATIC REVIEW article

Front. Artif. Intell.

Sec. Natural Language Processing

Volume 8 - 2025 | doi: 10.3389/frai.2025.1466092

A review on knowledge and information extraction from PDF documents and storage approaches

Provisionally accepted
  • 1International Centre of Insect Physiology and Ecology (ICIPE), Nairobi, Kenya
  • 2University of KwaZulu-Natal, Durban, KwaZulu-Natal, South Africa

The final, formatted version of the article will be published soon.

Automating the extraction of information from Portable Document Format (PDF) documents represents a significant milestone in information extraction, potentially reducing manual labor and facilitating knowledge discovery across diverse domains such as healthcare, law, and biochemistry.However, the reliability of current solutions remains contested, particularly in terms of accuracy, domain adaptability, and the effort required to implement robust systems. This study presents a comprehensive review of existing literature on information extraction from PDF documents, conducted using the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) methodology. The review identifies prevailing trends and methodologies in information extraction, including rule-based systems, statistical learning approaches, and neural networkbased models, while highlighting their limitations. Challenges include, among others, the rigidity and complexity of rule-based methods, the scarcity of well-annotated, domain-specific datasets for learning-based approaches, and issues such as hallucinations in large language models.To address these shortcomings, the study proposes a conceptual framework comprising nine core components: projects manager, documents manager, document pre-processor, ontology manager, information extractor, annotation engine, question-answering tool, knowledge visualizer, and data exporter. This framework is intended to enhance the accuracy, domain adaptability, and usability of PDF information extraction systems.

Keywords: Natural Language Processing, Large language models, Knowledge base, knowledge extraction, Knowledge graphs

Received: 17 Jul 2024; Accepted: 18 Aug 2025.

Copyright: © 2025 Salvador, TONNANG and Odindi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Atagong Desconsciences Salvador, International Centre of Insect Physiology and Ecology (ICIPE), Nairobi, 00100, Kenya

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.