Your new experience awaits. Try the new design now and help us make it even better

METHODS article

Front. Digit. Health

Sec. Connected Health

Volume 7 - 2025 | doi: 10.3389/fdgth.2025.1610228

This article is part of the Research TopicUnlocking the Potential of Health Data: Interoperability, Security, and Emerging Challenges in AI, LLM, Precision Medicine, and Their Impact on Healthcare and ResearchView all 5 articles

Secure Latent Dirichlet Allocation *

Provisionally accepted
Thijs  VeugenThijs Veugen1,2*Vincent  DunningVincent Dunning1Michiel  MarcusMichiel Marcus1Bart  KamphorstBart Kamphorst1
  • 1Netherlands Organisation for Applied Scientific Research, Amsterdam, Netherlands
  • 2University of Twente, Enschede, Netherlands

The final, formatted version of the article will be published soon.

Topic modelling refers to a popular set of techniques used to discover hidden topics that occur in a collection of documents. These topics can, for example, be used to categorize documents or label text for further processing. One popular topic modelling technique is Latent Dirichlet Allocation (LDA). In topic modelling scenarios, the documents are often assumed to be in one, centralized dataset. However, sometimes documents are held by different parties, and contain privacy-or commercially-sensitive information that cannot be shared. We present a novel, decentralized approach to train an LDA model securely without having to share any information about the content of the documents. We preserve the privacy of the individual parties using a combination of privacy enhancing technologies. Next to the secure LDA protocol, we introduce two new cryptographic building blocks that are of independent interest; a way to efficiently convert between secret-shared-and homomorphic-encrypted data as well as a method to efficiently draw a random number from a finite set with secret weights. We show that our decentralized, privacy preserving LDA solution has a similar accuracy compared to an (insecure) centralised approach. With 1024-bit Paillier keys, a topic model with 5 topics and 3000 words can be trained in around 16 hours. Furthermore, we show that the solution scales linearly in the total number of words and the number of topics.

Keywords: latent Dirichelet allocation, Secure multi-party computation, Shamir secret sharing, Paillier crypto system, Topic modelling (LDA)

Received: 11 Apr 2025; Accepted: 03 Jul 2025.

Copyright: © 2025 Veugen, Dunning, Marcus and Kamphorst. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Thijs Veugen, Netherlands Organisation for Applied Scientific Research, Amsterdam, Netherlands

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.