LLMs Achieve Adult Human Performance on Higher-Order Theory of Mind Tasks

Street, Winnie; Siy, John  Oliver; Keeling, Geoff; Baranes, Adrien; Barnett, Benjamin; McKibben, Michael; Kanyere, Tatenda; Agüera y Arcas, Blaise; Dunbar, Robin

doi:10.3389/fnhum.2025.1633272

ORIGINAL RESEARCH article

Front. Hum. Neurosci.

Sec. Interacting Minds and Brains

This article is part of the Research TopicPromoting brain health and emotional regulation through prosocial behaviors, social connectedness, and AIView all articles

LLMs Achieve Adult Human Performance on Higher-Order Theory of Mind Tasks

Provisionally accepted

Adrien Baranes⁴

Blaise Agüera y Arcas⁸

Robin Dunbar⁹

¹Google (United Kingdom), London, United Kingdom
²Google, Cambridge, MA, United States
³Google, London, United Kingdom
⁴Google DeepMind, London, United Kingdom
⁵Google, New York, United States
⁶Applied Physics Lab, Johns Hopkins University, Maryland, United States
⁷Work done at Google Research via Harvey Nash, London, United Kingdom
⁸Google, Seattle, United States
⁹Department of Experimental Psychology, University of Oxford, Oxford, United Kingdom

The final, formatted version of the article will be published soon.

This paper examines the extent to which large language models (LLMs) are able to perform tasks which require higher-order theory of mind (ToM)–the human ability to reason about multiple mental and emotional states in a recursive manner (e.g. I think that you believe that she knows). This paper builds on prior work by introducing a handwritten test suite – Multi-Order Theory of Mind Q&A – and using it to compare the performance of five LLMs of varying sizes and training paradigms to a newly gathered adult human benchmark. We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level performance on our ToM tasks overall, and that GPT-4 exceeds adult performance on 6th order inferences. Our results suggest that there is an interplay between model size and finetuning for higher-order ToM performance, and that the linguistic abilities of large models may support more complex ToM inferences. Given the important role that higher-order ToM plays in group social interaction and relationships, these findings have significant implications for the development of a broad range of social, educational and assistive LLM applications.

Keywords: Large language models, Theory of Mind, AI, social cognition, Mentalizing

Received: 22 May 2025; Accepted: 21 Oct 2025.

Copyright: © 2025 Street, Siy, Keeling, Baranes, Barnett, McKibben, Kanyere, Agüera y Arcas and Dunbar. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Winnie Street, istreet@google.com

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.