ORIGINAL RESEARCH article
Front. Hum. Neurosci.
Sec. Interacting Minds and Brains
Volume 19 - 2025 | doi: 10.3389/fnhum.2025.1633272
This article is part of the Research TopicPromoting brain health and emotional regulation through prosocial behaviors, social connectedness, and AIView all articles
LLMs Achieve Adult Human Performance on Higher-Order Theory of Mind Tasks
Provisionally accepted- 1Google (United Kingdom), London, United Kingdom
- 2Google, Cambridge, MA, United States
- 3Google, London, United Kingdom
- 4Google DeepMind, London, United Kingdom
- 5Google, New York, United States
- 6Applied Physics Lab, Johns Hopkins University, Maryland, United States
- 7Work done at Google Research via Harvey Nash, London, United Kingdom
- 8Google, Seattle, United States
- 9Department of Experimental Psychology, University of Oxford, Oxford, United Kingdom
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
This paper examines the extent to which large language models (LLMs) are able to perform tasks which require higher-order theory of mind (ToM)–the human ability to reason about multiple mental and emotional states in a recursive manner (e.g. I think that you believe that she knows). This paper builds on prior work by introducing a handwritten test suite – Multi-Order Theory of Mind Q&A – and using it to compare the performance of five LLMs of varying sizes and training paradigms to a newly gathered adult human benchmark. We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level performance on our ToM tasks overall, and that GPT-4 exceeds adult performance on 6th order inferences. Our results suggest that there is an interplay between model size and finetuning for higher-order ToM performance, and that the linguistic abilities of large models may support more complex ToM inferences. Given the important role that higher-order ToM plays in group social interaction and relationships, these findings have significant implications for the development of a broad range of social, educational and assistive LLM applications.
Keywords: Large language models, Theory of Mind, AI, social cognition, Mentalizing
Received: 22 May 2025; Accepted: 21 Oct 2025.
Copyright: © 2025 Street, Siy, Keeling, Baranes, Barnett, McKibben, Kanyere, Agüera y Arcas and Dunbar. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Winnie Street, istreet@google.com
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.