AUTHOR=Berzi András , Berényi Ervin , Képes Zita , Antal Barnabás , Varga Ábrahám Gergely , Emri Miklós 

TITLE=NLP-based removal of personally identifiable information from Hungarian electronic health records

JOURNAL=Frontiers in Artificial Intelligence

VOLUME=Volume 8 - 2025

YEAR=2025

URL=https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1585260

DOI=10.3389/frai.2025.1585260

ISSN=2624-8212

ABSTRACT=IntroductionElectronic health records (EHR) in text format serve as crucial resources for data-driven medical research. To safeguard patient confidentiality, under the General Data Protection Regulation (GDPR), strict measures are required to ensure personal data is anonymized or pseudonymized to protect individual privacy. Natural language processing has consistently proven effective in automating the de-identification of sensitive information.MethodsWe present spaCy models to recognize personally identifiable information (PII) from a wide range of free-text medical records written in Hungarian, a low-resource language. To develop this model, we compiled a corpus of clinical documents by annotating sensitive information within electronic health records sourced from the University of Debrecen. To simplify the annotation process, we pre-annotated the documents using a rule-based method. The corpora comprises over 15,000 documents and includes more than 90,000 instances of PII. We trained several models using this corpus and also developed a separate validation corpus to assess their performance.ResultsThe performance evaluation of the de-identification models on the developed corpora resulted in F1-scores ranging from 0.9697 to 0.9926. On the validation corpora, the F1-scores ranged from 0.9772 to 0.9867, demonstrating that the models can effectively handle previously unseen examples. Our risk analysis revealed that 99.67% of the sensitive information was successfully removed from the validation dataset.DiscussionThe results indicate that similarly to other state-of-the-art systems our model is highly effective at identifying PII in clinical texts, guaranteeing that sensitive information in clinical documents can be protected without sacrificing the quality or usability of the data for research purposes. Despite these positive outcomes, several areas remain to be improved, such as the conduction of additional testing on diverse datasets, particularly those from different healthcare institutions. With ongoing refinements, these models have the potential to greatly enhance the efficiency of data de-identification processes, ensuring compliance with privacy regulations while promoting the secure sharing of medical data for scientific progress.