ORIGINAL RESEARCH article
Front. Neuroinform.
Volume 19 - 2025 | doi: 10.3389/fninf.2025.1609077
Large Language Models Can Extract Metadata for Annotation of Human Neuroimaging Publications
Provisionally accepted- 1Department of Psychiatry and Behavioral Health, College of Medicine, The Ohio State University, Columbus, Ohio, United States
- 2Department of Medical Electronics Engineering, B.M.S. College of Engineering, Bengaluru, India, Bengaluru, India
- 3Faculty of Information, University of Toronto, Toronto, Ontario, Canada, Toronto, Canada
- 4Department of Population and Quantitative Health Sciences, School of Medicine, Case Western Reserve University, Cleveland, Ohio, United States
- 5School of Information and Library Science, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States
- 6Department of Computer Science and Engineering, University Visvesvaraya College of Engineering, Bangalore University, Bengaluru, India, Bengaluru, India
- 7Carolina Health Informatics Program, University of North Carolina, Chapel Hill, NC, USA, Chapel Hill, United States
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
We show that recent (mid-to-late 2024) commercial large language models (LLMs) are capable of good quality metadata extraction and annotation with very little work on the part of investigators for several exemplar real-world annotation tasks in the neuroimaging literature. We investigated the GPT-4o LLM from OpenAI which performed comparably with several groups of specially trained and supervised human annotators. The LLM achieves similar performance to humans, between 0.91 and 0.97 on zero-shot prompts without feedback to the LLM. Reviewing the disagreements between LLM and gold standard human annotations we note that actual LLM errors are comparable to human errors in most cases, and in many cases these disagreements are not errors. Based on the specific types of annotations we tested, with exceptionally reviewed gold-standard correct values, the LLM performance is usable for metadata annotation at scale. We encourage other research groups to develop and make available more specialized "micro-benchmarks," like the ones we provide here, for testing both LLMs, and more complex agent systems annotation performance in real-world metadata annotation tasks.
Keywords: Large language models, Metadata annotation, Information Extraction, Human Neuroimaging, ontologies, document annotation, text mining
Received: 09 Apr 2025; Accepted: 28 Jul 2025.
Copyright: © 2025 Turner, Appaji, Ar Rakib, Golnari, Rajasekar, Rathnam K V, Sahoo, Wang, Wang and Turner. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Matthew D Turner, Department of Psychiatry and Behavioral Health, College of Medicine, The Ohio State University, Columbus, 43210, Ohio, United States
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.