Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Digit. Health

Sec. Health Technology Implementation

Volume 7 - 2025 | doi: 10.3389/fdgth.2025.1608370

This article is part of the Research TopicDigital Medicine and Artificial IntelligenceView all 5 articles

TRACE: Applying AI Language Models to Extract Ancestry Information from Curated Biomedical Literature

Provisionally accepted
Alison  M VeintimillaAlison M Veintimilla1Cjomyam  K. AcharyaCjomyam K. Acharya2Connie  J. MulliganConnie J. Mulligan2Erika  MooreErika Moore1*Ruogu  FangRuogu Fang2*
  • 1University of Maryland, College Park, College Park, United States
  • 2University of Florida, Gainesville, Florida, United States

The final, formatted version of the article will be published soon.

Ancestry reporting is essential to ensure transparency and proper representation in biomedical studies. However, manually extracting this information from study texts is time-consuming and inefficient. In this paper, we present TRACE (Tool for Researching Ancestry and Cell Extraction), powered by GPT-4 and web-crawling, to automate ancestry identification by detecting cell lines or cultures in texts and tracing their ancestry. TRACE extracts cell lines and primary cultures from research articles and follows web sources to determine their ancestry. We compared TRACE’s outputs to a manually generated database to confirm its performance in identifying and verifying ancestry information. The results reveal an overrepresentation of European/White samples and significant underreporting. TRACE enables large-scale, systematic ancestry analysis—a valuable resource for researchers and agencies assessing biases in sample selection. As an open-source tool, it facilitates broader use to evaluate and improve ancestry representation in biomedical research.

Keywords: Ancestry Representation 1, Automated Text Mining 2, Cell Line Identification 3, Biomedical Research Equity 4, cells 5

Received: 08 Apr 2025; Accepted: 31 Jul 2025.

Copyright: © 2025 Veintimilla, Acharya, Mulligan, Moore and Fang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence:
Erika Moore, University of Maryland, College Park, College Park, United States
Ruogu Fang, University of Florida, Gainesville, 32609, Florida, United States

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.