Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Hum. Neurosci.

Sec. Interacting Minds and Brains

Volume 19 - 2025 | doi: 10.3389/fnhum.2025.1633272

This article is part of the Research TopicPromoting brain health and emotional regulation through prosocial behaviors, social connectedness, and AIView all articles

LLMs Achieve Adult Human Performance on Higher-Order Theory of Mind Tasks

Provisionally accepted
Winnie  StreetWinnie Street1*John  Oliver SiyJohn Oliver Siy2Geoff  KeelingGeoff Keeling3Adrien  BaranesAdrien Baranes4Benjamin  BarnettBenjamin Barnett5Michael  McKibbenMichael McKibben6Tatenda  KanyereTatenda Kanyere7Blaise  Agüera y ArcasBlaise Agüera y Arcas8Robin  DunbarRobin Dunbar9
  • 1Google (United Kingdom), London, United Kingdom
  • 2Google, Cambridge, MA, United States
  • 3Google, London, United Kingdom
  • 4Google DeepMind, London, United Kingdom
  • 5Google, New York, United States
  • 6Applied Physics Lab, Johns Hopkins University, Maryland, United States
  • 7Work done at Google Research via Harvey Nash, London, United Kingdom
  • 8Google, Seattle, United States
  • 9Department of Experimental Psychology, University of Oxford, Oxford, United Kingdom

The final, formatted version of the article will be published soon.

This paper examines the extent to which large language models (LLMs) are able to perform tasks which require higher-order theory of mind (ToM)–the human ability to reason about multiple mental and emotional states in a recursive manner (e.g. I think that you believe that she knows). This paper builds on prior work by introducing a handwritten test suite – Multi-Order Theory of Mind Q&A – and using it to compare the performance of five LLMs of varying sizes and training paradigms to a newly gathered adult human benchmark. We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level performance on our ToM tasks overall, and that GPT-4 exceeds adult performance on 6th order inferences. Our results suggest that there is an interplay between model size and finetuning for higher-order ToM performance, and that the linguistic abilities of large models may support more complex ToM inferences. Given the important role that higher-order ToM plays in group social interaction and relationships, these findings have significant implications for the development of a broad range of social, educational and assistive LLM applications.

Keywords: Large language models, Theory of Mind, AI, social cognition, Mentalizing

Received: 22 May 2025; Accepted: 21 Oct 2025.

Copyright: © 2025 Street, Siy, Keeling, Baranes, Barnett, McKibben, Kanyere, Agüera y Arcas and Dunbar. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Winnie Street, istreet@google.com

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.