ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. Medicine and Public Health

Volume 8 - 2025 | doi: 10.3389/frai.2025.1585260

NLP-based removal of personally identifiable information from Hungarian electronic health records

Provisionally accepted
András  BerziAndrás Berzi1*Ervin  BerényiErvin Berényi2Zita  KépesZita Képes1Barnabás  AntalBarnabás Antal3Ábrahám  Gergely VargaÁbrahám Gergely Varga4Miklós  EmriMiklós Emri1
  • 1Division of Nuclear Medicine and Translational Imaging, Department of Medical Imaging, Faculty of Medicine, University of Debrecen, Debrecen, Hungary
  • 2Division of Radiology and Imaging Science, Department of Medical Imaging, Faculty of Medicine, University of Debrecen, Debrecen, Hungary
  • 3Organisational units directly managed by the Healthcare Vice-Chancellor, Health Finance Directorate, Chancellery, University of Debrecen, Debrecen, Hungary
  • 4Organisational units directly managed by the Healthcare Vice-Chancellor, Process Regulation Centre of the Clinical Centre, Chancellery, University of Debrecen, Debrecen, Hungary

The final, formatted version of the article will be published soon.

Introduction: Electronic health records (EHR) in text format serve as crucial resources for datadriven medical research. To safeguard patient confidentiality, under the General Data Protection Regulation (GDPR), strict measures are required to ensure personal data is anonymized or pseudonymized to protect individual privacy. Natural language processing has consistently proven effective in automating the de-identification of sensitive information.We present spaCy models to recognize personally identifiable information (PII) from a wide range of free-text medical records written in Hungarian, a low-resource language. To develop this model, we compiled a corpus of clinical documents by annotating sensitive information within electronic health records sourced from the University of Debrecen. To simplify the annotation process, we pre-annotated the documents using a rule-based method. The corpora comprises over 15,000 documents and includes more than 90,000 instances of PII. We trained several models using this corpus and also developed a separate validation corpus to assess their performance.The performance evaluation of the de-identification models on the developed corpora resulted in F1-scores ranging from 0.9697 to 0.9926. On the validation corpora, the F1-scores ranged from 0.9772 to 0.9867, demonstrating that the models can effectively handle previously unseen examples. Our risk analysis revealed that 99.67% of the sensitive information was successfully removed from the validation dataset.The results indicate that similarly to other state-of-the-art systems our model is highly effective at identifying PII in clinical texts, guaranteeing that sensitive information in clinical documents can be protected without sacrificing the quality or usability of the data for research purposes. Despite these positive outcomes, several areas remain to be improved, such as the conduction of additional testing on diverse datasets, particularly those from different healthcare institutions. With ongoing refinements, these models have the potential to greatly enhance the efficiency of data de-identification processes, ensuring compliance with privacy regulations while promoting the secure sharing of medical data for scientific progress.

Keywords: de-identification, Electronic Health Records (EHR), General Data Protection Regulation (GDPR), Hungarian, Low-resource language, Named Entity Recognition (NER), natural language processing (NLP)

Received: 28 Feb 2025; Accepted: 29 Apr 2025.

Copyright: © 2025 Berzi, Berényi, Képes, Antal, Varga and Emri. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: András Berzi, Division of Nuclear Medicine and Translational Imaging, Department of Medical Imaging, Faculty of Medicine, University of Debrecen, Debrecen, Hungary

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.