Your new experience awaits. Try the new design now and help us make it even better

PERSPECTIVE article

Front. Syst. Neurosci., 17 November 2025

Volume 19 - 2025 | https://doi.org/10.3389/fnsys.2025.1683133

This article is part of the Research TopicNeurobiological foundations of cognition and progress towards Artificial General IntelligenceView all 4 articles

Will multimodal large language models ever achieve deep understanding of the world?

  • 1Department of Applied Informatics, Comenius University Bratislava, Bratislava, Slovakia
  • 2Czech Institute of Informatics, Robotics and Cybernetics, Prague, Czechia
  • 3Department of Informatics, University of Hamburg, Hamburg, Germany

Despite impressive performance in various tasks, large language models (LLMs) are subject to the symbol grounding problem, so from the cognitive science perspective, one can argue that they are merely statistics-driven distributional models without a deeper understanding. Modern multimodal versions of LLMs (MLLMs) are trying to avoid this problem by linking language knowledge with other modalities such as vision (Vision Language Models called VLM) or action (Vision Language Action Models called VLA) when, for instance, a robotic agent, is acting in the world. If eventually successful, MLLMs could be taken as pathway for symbol grounding. In this work, we explore the extent to which MLLMs integrated with embodied agents can achieve such grounded understanding through interaction with the physical world. We argue that closing the gap between symbolic tokens, neural representations, and embodied experience will require deeper developmental integration of continuous sensory data, goal-directed behavior, and adaptive neural learning in real-world environments. We raise a concern that MLLMs do not currently achieve a human-like level of deep understanding, largely because their random learning trajectory deviates significantly from human cognitive development. Humans typically acquire knowledge incrementally, building complex concepts upon simpler ones in a structured developmental progression. In contrast, MLLMs are often trained on vast, randomly ordered datasets. This non-developmental approach, which circumvents a structured simple-to-complex conceptual scaffolding, inhibits the ability to build a deep and meaningful grounded knowledge base, posing a significant challenge to achieving human-like semantic comprehension.

1 Introduction

Due to AI emergence (in the 1950s) and subsequent invention of more neural machine learning approaches (from 1990s), now we have generative AI that has progressed significantly over the last decade, resulting in large language models (LLMs), trained on vast amounts of text (Minaee et al., 2025). We have huge language models that can be used—when accompanied by appropriate interface modules—for various linguistic tasks (Brown et al., 2020). Since language describes the world, LLMs contain the knowledge of the world encoded in neural networks, with all its implications. Language has a strong expressive power, yet it is discrete (since words are symbols) and certain world knowledge is hard to describe in words (a picture is worth a thousand words or how to drive a car).

In the paper, we approach the concept of understanding as a graded phenomenon, that is, a matter of degree, which can be tested behaviorally. We look at understanding from the perspective of grounding that enables the learning system to acquire intrinsic meanings (semantics) autonomously (Harnad, 1990). In a nutshell, deep understanding is assumed to be human-like by definition. LLMs are argued to lack understanding because they are completely ungrounded. Current approaches to combine LLMs with various modalities are assumed to provide shallow understanding, and hence the main question is whether this path can eventually lead to human-like deep understanding.

1.1 Symbol grounding problem

LLMs are subject to the symbol grounding problem (SGP) (Harnad, 1990), because the meanings of the words they generate are not grounded in the world (Huang et al., 2023; Vavrečka and Farkaš, 2014). More specifically, the SGP has been further developed toward the vector grounding problem (Mollo and Millire, 2023), because in an LLM, the words are not represented as symbols, but subsymbolically as high-dimensional vectors (representing the meaning). Despite this change, the grounding problem remains because vector components are not connected to the world either but to other symbols.

As such, the representations produced by LLMs are generally still decoupled from perceptual and sensorimotor experience, keeping the system domain closed and unable to develop intrinsic meaning or intentionality (Bisk et al., 2020; Incao et al., 2024). This limitation has significant implications for applications that require an understanding of situated meaning, such as embodied AI, human–robot interaction, or multimodal perception. Even when LLMs are augmented with external tools or coupled with sensors and actuators (as in robotics), bridging the gap between symbolic tokens, neural representations, and embodied experience remains a major challenge (Tellex et al., 2020).

Recent research has proposed various strategies to mitigate this problem, such as grounding language in visual perception (Bender and Koller, 2020; Šejnová et al., 2018; Zhang et al., 2023), action (Mao et al., 2019; Szot et al., 2024), or interactive dialogue (Yu et al., 2020; Jokinen and Wilcock, 2024). However, these efforts often fall short of full grounding, since they still rely heavily on pre-trained LLMs with static linguistic priors. This means that language is first developed separately and only later it becomes linked to other modalities. Addressing the vector grounding problem may ultimately require architectures that integrate language, perception, and action in a tightly coupled developmental framework, where meaning emerges from ongoing interaction with the world (Barsalou, 2008; Cangelosi and Asada, 2022; Kerzel et al., 2023).

1.2 Turing test and irrationality

LLMs have passed the Turing test and have progressed impressively in various tasks (Jones and Bergen, 2025). They have even been considered candidates for Artificial General Intelligence (Bubeck et al., 2023). However, shortcomings have been identified, labeled “hallucinations” (Banerjee et al., 2024) that also involve errors of reasoning, although the newer versions of LLM are improved compared to their predecessors (Mirzadeh et al., 2025). However, LLMs still show irrationality as humans, but in different ways (Macmillan-Scott and Musolesi, 2024). When LLMs give incorrect answers to these tasks, they are often incorrect in ways that differ from human-like biases. Their errors stem from statistical pattern matching on vast text corpora without true comprehension, rather than from the evolutionary-developed cognitive biases and emotional states that influence human judgment. In addition to this, LLMs reveal an additional layer of irrationality in the significant inconsistency of responses. This key aspect of AI will likely be investigated in more depth in the near future (Macmillan-Scott and Musolesi, 2025).

Recent research has shown that these inconsistencies can span not just factual hallucinations, but also logical reasoning, moral judgments, and even self-contradiction within the same prompt or across similar prompts (Lin et al., 2021; Perez et al., 2023). Similar problems were also detected in modular multimodal grounding architectures, where an imbalanced training dataset led to logical inconsistencies in counting tasks (Šejnová et al., 2018). These inconsistencies raise questions about the reliability and interpretability of LLM-generated outputs, particularly in high-stakes applications. As such, ongoing research on alignment, calibration, and consistency is crucial to mitigate these shortcomings and to align LLMs more closely with human expectations and rational norms (Ouyang et al., 2022; Ganguli et al., 2023).

Another fundamental problem with LLMs is their vulnerability to adversarial attacks (Wang et al., 2021; Zou et al., 2023), which means that the trained model can easily be fooled with well crafted input (prompts, in the case of language) into producing irrational or wrong answers. This problem is general and applies to all deep models in various domains, be it computer vision, games, or language (Ren et al., 2020). Adversarial machine learning has become a modern research path that aims to overcome the vulnerability problem, but only time will tell whether this will offer a principled solution (Pelekis et al., 2025).

2 Building AI with grounded knowledge

Based on recent developments in MLLMs (e.g. Allgeuer et al., 2025; Ali et al., 2025), we can see that there exist two different approaches (in several aspects) to build AI systems that could achieve a deeper understanding. We will call them developmental and non-developmental, and they are illustrated in Figure 1.

Figure 1
Illustration comparing developmental and non-developmental robots. The developmental robot interacts with a world and language, while the non-developmental robot interacts with a world and LLM, with arrows indicating information exchange.

Figure 1. Simplified sketch contrasting two approaches to building AI agents with grounded knowledge. Developmental approach: Thick arrows denote rich (perception-action) interaction between the embodied agent and the multimodal world (represented by the globe). World knowledge (represented by language) is grounded gradually within developmental stages, going from more concrete toward more abstract concepts (shown by decreasing level of gray color), simultaneously with the world interaction. Non-developmental approach: Thick arrow from the world denotes dense information flow (represented mainly by images and videos), whereas the thin arrow denotes limited acting in the world, making the interaction very imbalanced. As for language, symbol grounding occurs in pre-given batches (using an LLM), regardless of their concreteness, with co-occurring perceptual data (from the world).

2.1 Developmental approach

The developmental approach is represented by the way humans acquire intelligence within their ontogeny, as also addressed in cognitive developmental robotics (Asada et al., 2009). This neurobiologically inspired approach is inspired by empirical literature showing how development progresses in stages within the so-called curriculum learning (Parisi et al., 2019) and how abstract conceptual knowledge is built on top of understanding concrete concepts in particular and reliable ways (Yee, 2019).

The developmental approach is compatible with the theories of grounded cognition such as conceptual metaphor theory (Lakoff and Johnson, 1980), simulation theory (Barsalou, 2008), or the action semantics framework (van Elk et al., 2013). The symbol grounding is enabled by two pathways, as argued in Reinboth and Farkaš (2022). The first pathway is direct and mainly involves action/perception, interoception, and emotions. The second pathway is indirect, being mediated by language in particular. This means, in a simplified way, that concrete words (e.g., dog, bus) are well grounded using the direct pathway, as they typically refer to objects in the world, whereas abstract words (e.g., truth, democracy) rely more on language, since they lack direct world referents. As LLM training data consists of disembodied text, which lacks the sensory, emotional, and proprioceptive inputs that form the basis of interoception, LLMs are fundamentally cut off from the direct pathway. They are involved in the indirect pathway, building complex linguistic associations, while still missing the foundational direct-grounding pathway.

In developmental robotics, the agent gradually acquires knowledge in a number of modalities (perception, action/proprioception, and language). Hence, linguistic knowledge becomes grounded from the very beginning1 and progresses during development, going hand in hand with the development of other cognitive functions. In addition, each AI agent that acquires knowledge uses its own body (single embodiment), which constrains the brain–body–world relationship.

This embodiment constraint not only provides sensory and motor contingencies, but also helps structure experience in ways that facilitate abstraction (Pfeifer and Bongard, 2007; Cangelosi and Schlesinger, 2015). Furthermore, the importance of social interaction has been widely emphasized, as it enables scaffolding, joint attention, and cultural learning mechanisms that are critical to the acquisition of language and abstract concepts (Tomasello, 1999). The developmental approach argues that truly general intelligence cannot emerge without a grounding in temporally extended experience, sensorimotor interaction, and socially mediated learning processes (Qureshi et al., 2025). This aligns with previous suggestions for the integration of developmental psychology, neuroscience, and robotics to build more human-like AI systems (Lungarella et al., 2003).

2.2 Non-developmental approach

The non-developmental approach is represented by modern generative AI, namely (M)LLMs. In this case, we have LLMs that have been trained on a vast number of linguistic corpora (i.e., including both concrete and abstract words) and hence are subject to the vector grounding problem. All words are treated in the same way since the distributional statistics is computed for all of them, regardless of their natural age of acquisition in humans (abstract words are typically acquired later). The non-developmental paradigm is extensively documented in a recent comprehensive survey by (Li et al. 2025a) comprising early modular, perception-driven systems to today's unified, language-centric frameworks.

While we acknowledge that modern MLLMs often employ sophisticated training strategies, including forms of curriculum learning and staged fine-tuning, our non-developmental approach refers to a more fundamental distinction. The curricula in MLLMs, however complex, are ultimately designed by humans to optimize learning on a static, pre-collected dataset and rely on passive data association. In contrast, a developmental curriculum is self-generated by the agent's own sensorimotor engagement with its environment, where learning is driven by the real-time consequences of action and exploration. The critical difference is not the presence or absence of a curriculum, but whether that curriculum relies on external passive data or it is intrinsically generated through active, embodied experience.

As a result of this difference, LLMs often fail to differentiate between the groundedness of concrete words versus abstract words, despite evidence that such distinctions are cognitively and neurally meaningful in humans (Pulvermller, 2013). Moreover, the lack of embodied interaction and temporally extended learning leads to deficits in causal and commonsense reasoning, as LLMs lack experiential continuity and episodic memory (Lake et al., 2017; Bender et al., 2021). Although these models show remarkable linguistic fluency, they remain detached from the interactive multimodal learning that characterizes human intelligence (Zador, 2019). This disconnect highlights the limitations of non-developmental systems in achieving robust generalization and true understanding.

3 State-of-the-art multimodal LLMs

Compared to LLM, the MLLMs process multiple modalities (text, images, audio, video, and structured data). They are trained to integrate knowledge from multiple sources and often perform very well in tasks such as visual question answering, caption generation, and cross-modal retrieval (Ge et al., 2024). Models that integrate only vision and language are called VLM and are used in tasks without motor modality (Wu et al., 2024).

Given the “passive nature” of VLMs, more recent models attempt to include the motor modality. It has a special feature, as it is related to a concrete embodiment that defines all degrees of freedom (Vemprala et al., 2024; Li et al., 2024b; Lan et al., 2025). As such, it is not trivial to generate huge collections of data to be used later for learning (Peng et al., 2024). Moreover, motor interaction datasets are often specific to particular robotic platforms and require physical execution, making large-scale, standardized motor data scarce and costly to produce. This creates a bottleneck for training general-purpose embodied agents, in contrast to the relative abundance of text or image datasets. There is a possibility to create training data for multimodal systems in simulations (Vavrečka et al., 2021; Li et al., 2024a) but they need to be fine-tuned to the real robots.

This recent category of MLLMs is represented by Vision Language Action Models (VLA) (Kim et al., 2025), which can be described as interactive systems or agents that take advantage of VLMs or MLLMs as components to perceive, reason, and act within a given environment (Šejnová et al., 2024). The newest VLA models (Zhao et al., 2025) are capable of reasoning about possible future states by adopting chain-of-thought reasoning (Zhao et al., 2024; Chen et al., 2025), wherein a model predicts helpful intermediate representations before choosing actions. These models integrate high-level planning with low-level control and are frequently designed to operate under partially observable conditions using dynamic world models. VLA systems are often embodied in humanoid robots and tested in real-world tasks, ranging from manipulation and navigation to goal-driven interaction (NVIDIA et al., 2025; Team et al., 2025). For a detailed review of VLA capabilities, see (Din et al. 2025).

The best-known VLA models include Google's RT-2 (Brohan et al., 2023), which combines VLMs with robotic control policies, and PaLM-E (Driess et al., 2023), a general-purpose embodied MLLM that processes visual, linguistic, and proprioceptive inputs. DeepMind's Gato MLLM (Reed et al., 2022) also stands out as an apparently universal agent capable of playing Atari games, captioning images, and controlling a robot arm using the same neural architecture. Other efforts like SayCan (Ahn et al., 2022) and VIMA (Jiang et al., 2023) emphasize instruction-following through natural language grounding, enabling flexible and scalable robotic behaviors. These models demonstrate the increasing capability of VLA systems to bridge perception, cognition, and action, paving the way toward more general and adaptive embodied agents. A very recent effort, NVIDIA's Cosmos-Reason1, directly tackles this grounding problem by building models specialized in physical commonsense and embodied reasoning (Azzolini et al., 2025). The approach employs a structured two-stage training process (supervised fine-tuning and reinforcement learning) on newly curated data sets that are explicitly designed around detailed ontologies of principles of the physical world.

MLLMs integrate multiple sensory inputs and excel at perception-language tasks, but they follow a non-developmental approach, relying on massive pre-training datasets without embodied interactions, leading to the vector grounding problem, where meaning is symbolically derived from other symbols. In contrast, VLA models add the motor modality, enabling agents to perceive, reason, and act in real-world environments, making them more suitable for addressing the SGP. Although VLA systems (e.g., RT-2, PaLM-E, Gato, Gr00t or Pi0) begin to incorporate elements of a developmental approach through embodiment and interaction, they still often lack the gradual, stage-wise learning, and experiential grounding characteristic of human development. As constructive feedback, (Marshall and Barron 2025) suggest that instead of focusing on scaling up transformer models [standing behind (M)LLMs], researchers should look at biological evolution and the elegant solutions it has produced to create truly autonomous agents. By learning from the efficiency and robustness of natural systems, the field of robotics can move toward the development of intelligent machines.

An alternative approach to creating representations similar to humans is based on explicit world models, which are internal predictive simulations of an environment that allow an agent to reason about causality and plan by imagining the consequences of possible actions (Ha and Schmidhuber, 2018). The primary distinction from the VLA architecture is functional and internal; While a VLA is defined by its ability to produce motor commands, a world model is an internal component used for deliberative planning by simulating what-if scenarios before acting (LeCun, 2022). Recent advances have demonstrated that MLLMs are capable of learning implicit world models directly from internet-scale video data, enabling them to generate novel, interactive, and controllable environments from a single image prompt, thus capturing a deeper understanding of environmental dynamics than a purely reactive VLA (Reid et al., 2024). On the other hand, the MLLM's understanding is confined to statistical correlations rather than a genuine grasp of cause and effect, critically limiting its ability to plan for the long term or generalize to truly novel situations.

4 Discussion

Our paper asks the fundamental question whether MLLMs have the potential to ever achieve understanding of the world. In other words: Using the above-mentioned non-developmental approaches, can MLLMs in principle learn causal relations, based on common sense of the world? Examples of MLLMs typically focus on capturing correlations among modalities, but correlation does not imply causation, and many causal effects are hidden from observation. (Zhu et al. 2020) identify five core domains of cognitive AI with human-like common sense (functionality, physics, intent, causality, and utility). They argue that the next generation of AI must embrace human-like common sense to solve novel tasks.

Although MLLMs show impressive pattern recognition across modalities, they struggle to infer latent variables, predict unseen consequences of actions, or distinguish cause from coincidence, skills critical to robust common sense (Li et al., 2025b). As highlighted by (Pearl and Mackenzie 2018), causal reasoning requires interventions and counterfactual thinking, both of which are still missing in current MLLMs due to their lack of embodiment and interactive experience. Models trained passively on the data cannot easily form causal mental models or simulate hypothetical scenarios. In contrast, systems with active learning, embodied exploration, and developmental grounding (as proposed in cognitive robotics) are better suited to acquire and generalize causal knowledge, because these properties support imitation of the developmental approach. Therefore, without mechanisms for intervention and iterative feedback, non-developmental MLLMs are limited in achieving true causal understanding and thus fall short of capturing the deeper structure of human-like common sense. It is acknowledged that there exist MLLMs based on reinforcement learning using human feedback (RLHF), as an early attempt toward the inclusion of human knowledge, but in those RLHF systems the feedback is usually given at a certain single time and not continually.

Despite processing various input modalities, MLLMs lack a nonverbal world model – an internal, structured representation of the physical and social world that exists independently of language. Their understanding is rooted in statistical associations across data modalities, but these models do not build or manipulate mental simulations of the world the way humans do. Consequently, current MLLMs cannot truly reason without language. Their reasoning, perception, and decision-making processes are tightly coupled to textual representations. Unlike humans, who can form visual or sensorimotor imaginations and reason through spatial or embodied experience, MLLMs rely on verbal structures even for tasks that seem inherently non-verbal. This language dependency limits their ability for intuitive physical reasoning, spatial understanding, or mental imagery and makes their cognition fundamentally symbol-based rather than grounded in perceptual-motor reality.

These limitations of MLLMs are closely related to the hypothesis of linguistic relativity which asserts that language influences cognition (Ottenheimer, 2009). The softer version of relativity suggests that language shapes thought and perception (linguistic relativism). MLLMs are compatible with the stronger version of linguistic relativity (linguistic determinism), since they are substantially based on language-based representations to reason and understand the world. Their “cognition” is hence constrained by the linguistic information they are trained on. The absence of a non-verbal world model in MLLMs restricts their ability to form flexible, language-independent concepts, or an intuitive understanding of the world.

The limitations of MLLMs are also evident when viewed through the lens of mental imagery and theories of mental representation. According to depictive theories (e.g., Kosslyn, 1994), mental imagery involves spatial, quasi-perceptual representations that resemble visual experiences, whereas descriptive theories (e.g., Pylyshyn, 2003) argue that mental representations are symbolic and language-like. MLLMs operate solely within the descriptive domain, relying on text-based processing and lacking the capacity for depictive, image-like mental representations. As a result, they are unable to simulate mental imagery or reason spatially in a perceptual sense, limiting their ability to perform tasks that humans solve through visualization, such as imagining physical transformations or mentally rotating objects.

On the other hand, given that the progress in MLLMs is very fast, it may be that improved models will be appearing in forthcoming years, narrowing the gap toward understanding. On that path, it will definitely be inevitable to go beyond learning statistical correlations to reveal causal mechanisms. This will clearly involve a stronger role for the motor modality that facilitates this process in the context of affecting the world. Maybe the well functioning MLLMs would not be body-agnostic but rather tied to a concrete body, which is an open question. At the same time, well-trained nonlinguistic modules will have to be included to take care of all reasoning that does not require language. Last but not least, perhaps sophisticated control mechanisms will be required to carefully link modalities together, loosely resembling human development (following a growing complexity).

In summary, it remains to be seen whether these fundamental problems of MLLMs can be overcome if we assume that progress will occur in nonlinguistic AI models (supporting nonlinguistic cognition and reasoning) that will be integrated with LLMs. However, it seems that stage-wise world-based grounded integration of modalities is critical for embodied acquisition of common sense, and hence deep understanding of the world (Feng et al., 2025; Szot et al., 2025).

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

IF: Project administration, Supervision, Methodology, Writing – original draft, Funding acquisition, Visualization, Investigation, Conceptualization, Writing – review & editing. MV: Conceptualization, Writing – original draft, Writing – review & editing. SW: Funding acquisition, Writing – review & editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This research was supported by the Horizon Europe project TERAIS, the grant agreement no. 10107933. IF was also partially supported by Slovak Grant Agency for Science (VEGA), project 1/0373/23. MV was also supported by Czech Science Foundation (GACR), project 23-04080L.

Conflict of interest

The authors declare that the research was conducted in the absence of commercial or financial relationships that could be construed as a potential conflict of interest.

Correction note

This article has been corrected with minor changes. These changes do not impact the scientific content of the article.

Generative AI statement

The author(s) declare that no Gen AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Footnotes

1. ^In this context, we can ignore the difference that humans are typically exposed to spoken language, whereas LLMs work with the text, since from the current perspective, speech-to-text technology, and vice-versa, is becoming a solved problem.

References

Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., et al. (2022). “Do as I can, not as I say: Grounding language in robotic affordances,” in 5th Conference on Robot Learning (CoRL) (Auckland: PMLR).

Google Scholar

Ali, H., Allgeuer, P., and Wermter, S. (2025). Comparing apples to oranges: LLM-powered multimodal intention prediction in an object categorization task. Soc. Robot. 15563, 292–306. doi: 10.1007/978-981-96-3525-2_25

Crossref Full Text | Google Scholar

Allgeuer, P., Ahrens, K., and Wermter, S. (2025). “Unconstrained open vocabulary image classification: Zero-shot transfer from text to image via CLIP inversion,” in Proceedings of the Winter Conference on Applications of Computer Vision (WACV) (Tucson, AZ: IEEE), 8206–8217.

Google Scholar

Asada, M., Hosoda, K., Kuniyoshi, Y., Ishiguro, H., Inui, T., Yoshikawa, Y., et al. (2009). Cognitive developmental robotics: a survey. IEEE Trans. Autonom. Mental Dev. 1, 12–34. doi: 10.1109/TAMD.2009.2021702

Crossref Full Text | Google Scholar

Azzolini, A., Bai, J., Cao, J., Chattopadhyay, P., Chen, H., Cui, Y., et al. (2025). Cosmos-Reason1: From physical common sense to embodied reasoning. arXiv [preprint] arXiv:2503.15558. doi: 10.48550/arXiv.2503.15558

Crossref Full Text | Google Scholar

Banerjee, S., Agarwal, A., and Singla, S. (2024). LLMs will always hallucinate, and we need to live with this. arXiv [preprint] arXiv:2409.05746. doi: 10.1007/978-3-031-99965-9_39

Crossref Full Text | Google Scholar

Barsalou, L. W. (2008). Grounded cognition. Annual Rev. Psychol. 59, 617–645. doi: 10.1146/annurev.psych.59.103006.093639

PubMed Abstract | Crossref Full Text | Google Scholar

Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. (2021). “On the dangers of stochastic parrots: Can language models be too big?,” in ACM Conference on Fairness, Accountability, and Transparency (New york: ACM), 610–623. doi: 10.1145/3442188.3445922

Crossref Full Text | Google Scholar

Bender, E. M., and Koller, A. (2020). “Climbing towards NLU: on meaning, form, and understanding in the age of data,” in Annual Meeting of the Association for Computational Linguistics (ACL), 5185–5198.

Google Scholar

Bisk, Y., Holtzman, A., Thomason, J., Andreas, J., Bengio, Y., Chai, J. Y., et al. (2020). Experience grounds language. arXiv [preprint] arXiv:2004.10151. 10.48550/arXiv.2004.10151

Google Scholar

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., et al. (2023). RT-2: Vision-language-action models transfer web knowledge to robotic control. arXiv [preprint] arXiv:2307.15818. doi: 10.48550/arXiv.2307.15818

Crossref Full Text | Google Scholar

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., et al. (2020). “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, 33.

Google Scholar

Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., et al. (2023). Sparks of artificial general intelligence: early experiments with GPT-4. arXiv [preprint] arXiv:2303.12712. doi: 10.48550/arXiv.2303.12712

Crossref Full Text | Google Scholar

Cangelosi, A., and Asada, M. (2022). Cognitive Robotics. Cambridge, MA: MIT Press.

Google Scholar

Cangelosi, A., and Schlesinger, M. (2015). Developmental Robotics: From Babies to Robots. Cambridge, MA: MIT Press.

Google Scholar

Chen, W., Belkhale, S., Mirchandani, S., Mees, O., Driess, D., Pertsch, K., et al. (2025). Training strategies for efficient embodied reasoning. arXiv [preprint] arXiv:2505.08243. doi: 10.48550/arXiv.2505.08243

Crossref Full Text | Google Scholar

Din, M. U., Akram, W., Saoud, L. S., Rosell, J., and Hussain, I. (2025). Vision language action models in robotic manipulation: a systematic review. arXiv:2507.10672. doi: 10.48550/arXiv.2507.10672

Crossref Full Text | Google Scholar

Driess, D., Xia, F., Gharbi, M., Toshev, A., Chowdhery, A., Ichter, B., et al. (2023). PaLM-E: An embodied multimodal language model. arXiv [preprint] arXiv:2303.03378. doi: 10.48550/arXiv.2303.03378

Crossref Full Text | Google Scholar

Feng, T., Wang, X., Jiang, Y.-G., and Zhu, W. (2025). Embodied AI: From LLMs to world models. arXiv [preprint] arXiv:2509.20021. doi: 10.36227/techrxiv.175977432.27129012/v1

Crossref Full Text | Google Scholar

Ganguli, D., Joseph, N., Askell, A., Bai, Y., Lukoit, K., Chen, A., et al. (2023). The capacity for moral self-correction in large language models. arXiv [preprint] arXiv:2302.07459. doi: 10.48550/arXiv.2302.07459

Crossref Full Text | Google Scholar

Ge, Z., Huang, H., Zhou, M., Li, J., Wang, G., Tang, S., et al. (2024). WorldGPT: Empowering LLM as Multimodal World Model. Dublin: ACM Multimedia.

Google Scholar

Ha, D., and Schmidhuber, J. (2018). World models. arXiv [preprint] arXiv:1803.10122. doi: 10.48550/arXiv.1803.10122

Crossref Full Text | Google Scholar

Harnad, S. (1990). The symbol grounding problem. Physica D 42, 335–346. doi: 10.1016/0167-2789(90)90087-6

Crossref Full Text | Google Scholar

Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma, S., et al. (2023). “Language is not all you need: Aligning perception with language models,” in Advances in Neural Information Processing Systems. New Orleans, LA: Curran Associates, Inc.

Google Scholar

Incao, S., Mazzola, C., Belgiovine, G., and Sciutti, A. (2024). “A roadmap for embodied and social grounding in LLMs,” in Robophilosophy Conference. Amsterdam: IOS Press.

Google Scholar

Jiang, Y., Cai, J., Martn-Martn, R., and Fei-Fei, L. (2023). VIMA: General robot manipulation with multimodal prompts. arXiv [preprint] arXiv:2210.03094. doi: 10.48550/arXiv.2210.03094

Crossref Full Text | Google Scholar

Jokinen, K., and Wilcock, G. (2024). “The need for grounding in LLM-based dialogue systems,” in Proceedings of the 1st Workshop on Grounding in LLMs (Torino, Italia: ACL), 20–25.

Google Scholar

Jones, C. R., and Bergen, B. K. (2025). Large language models pass the Turing test. arXiv [preprint] arXiv:2503.23674. doi: 10.48550/arXiv.2503.23674

Crossref Full Text | Google Scholar

Kerzel, M., Allgeuer, P., Strahl, E., Frick, N., Habekost, J.-G., Eppe, M., et al. (2023). NICOL: A neuro-inspired collaborative semi-humanoid robot that bridges social interaction and reliable manipulation. IEEE Access 11, 123531–123542. doi: 10.1109/ACCESS.2023.3329370

Crossref Full Text | Google Scholar

Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., et al. (2025). “OpenVLA: An open-source vision-language-action model,” in 8th Conference on Robot Learning, 2679–2713.

Google Scholar

Kosslyn, S. M. (1994). Image and Brain: The Resolution of the Imagery Debate. Cambridge, MA: MIT Press.

Google Scholar

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. (2017). Building machines that learn and think like people. Behav. Brain Sci. 40:e253. doi: 10.1017/S0140525X16001837

PubMed Abstract | Crossref Full Text | Google Scholar

Lakoff, G., and Johnson, M. (1980). Metaphors We Live By. Chicago: Chicago University Press.

Google Scholar

Lan, G., Qu, K., Zurbrgg, R., Chen, C., Mower, C. E., Bou-Ammar, H., et al. (2025). Experience is the best teacher: Grounding vlms for robotics through self-generated memory. arXiv [preprint] arXiv:2507.16713. doi: 10.48550/arXiv.2507.16713

Crossref Full Text | Google Scholar

LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. Available online at: https://openreview.net/pdf?id=BZ5a1r-kVsf (Accessed October 15, 2025).

Google Scholar

Li, C., Zhang, R., Wong, J., Gokmen, C., Srivastava, S., Martín-Martín, R., et al. (2024a). BEHAVIOR-1K: a human-centered, embodied ai benchmark with 1,000 everyday activities and realistic simulation. arXiv [preprint] arXiv:2403.09227. doi: 10.48550/arXiv.2403.09227

Crossref Full Text | Google Scholar

Li, X., Zhang, M., Geng, Y., Geng, H., Long, Y., Shen, Y., et al. (2024b). “ManipLLM: Embodied multimodal large language model for object-centric robotic manipulation,” in arXiv [preprint] arXiv: 2312.16217. doi: 10.48550/arXiv.2312.16217

Crossref Full Text | Google Scholar

Li, Y., Liu, Z., Li, Z., Zhang, X., Xu, Z., Chen, X., et al. (2025a). Perception, reason, think, and plan: A survey on large multimodal reasoning models. arXiv [preprint] arXiv:2505.04921. doi: 10.48550/arXiv.2505.04921

Crossref Full Text | Google Scholar

Li, Z., Wang, H., Liu, D., Zhang, C., Ma, A., Long, J., et al. (2025b). Multimodal causal reasoning benchmark: challenging vision large language models to discern causal links across modalities. arXiv [preprint] arXiv:2408.08105. doi: 10.18653/v1/2025.findings-acl.288

Crossref Full Text | Google Scholar

Lin, S., Hilton, J., and Evans, O. (2021). TruthfulQA: measuring how models mimic human falsehoods. arXiv [preprint] arXiv:2109.07958. doi: 10.18653/v1/2022.acl-long.229

Crossref Full Text | Google Scholar

Lungarella, M., Metta, G., Pfeifer, R., and Sandini, G. (2003). Developmental robotics: a survey. Connect. Sci. 15, 151–190. doi: 10.1080/09540090310001655110

Crossref Full Text | Google Scholar

Macmillan-Scott, O., and Musolesi, M. (2024). (Ir)rationality and cognitive biases in large language models. Royal Soc. Open Sci. 11:240255. doi: 10.1098/rsos.240255

PubMed Abstract | Crossref Full Text | Google Scholar

Macmillan-Scott, O., and Musolesi, M. (2025). (Ir)rationality in ai: state of the art, research challenges and open questions. Artif. Intellig. Rev. 58. doi: 10.1007/s10462-025-11341-4

Crossref Full Text | Google Scholar

Mao, J., Gan, C., Kohli, P., Tenenbaum, J. B., and Wu, J. (2019). “The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision,” in International Conference on Learning Representations (New Orleand, LA: OpenReview).

Google Scholar

Marshall, J., and Barron, A. (2025). Are transformers truly foundational for robotics? NPJ Robotics 3:9. doi: 10.1038/s44182-025-00025-4

Crossref Full Text | Google Scholar

Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., et al. (2025). Large language models: A survey. arXiv [preprint] arXiv:2402.06196. doi: 10.48550/arXiv.2402.06196

Crossref Full Text | Google Scholar

Mirzadeh, S. I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., and Farajtabar, M. (2025). “GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models,” in International Conference on Learning Representations (Singapore: OpenReview).

Google Scholar

Mollo, D. C., and Millire, R. (2023). The vector grounding problem. arXiv [preprint] arXiv:2304.01481. doi: 10.48550/arXiv.2304.01481

Crossref Full Text | Google Scholar

Bjorck, J., Bjorck, J., Castaeda, F., Cherniadev, N., Da, X., Ding, R., et al. (2025). GR00T N1: An open foundation model for generalist humanoid robots. arXiv [preprint] arXiv:2503.14734. doi: 10.48550/arXiv.2503.14734

Crossref Full Text | Google Scholar

Ottenheimer, H. (2009). The Anthropology of Language: An Introduction to Linguistic Anthropology. Wadsworth: Cengage Learning.

Google Scholar

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., and Mishkinet, P.. (2022). Training language models to follow instructions with human feedback. arXiv [preprint] arXiv:2203.02155. doi: 10.48550/arXiv.2203.02155

Crossref Full Text | Google Scholar

Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter, S. (2019). Continual lifelong learning with neural networks: a review. Neural Netw. 113:54–71. doi: 10.1016/j.neunet.2019.01.012

PubMed Abstract | Crossref Full Text | Google Scholar

Pearl, J., and Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect. New York City: Basic Books.

Google Scholar

Pelekis, S., Koutroubas, T., Blika, A., Berdelis, A., Karakolis, E., Ntanos, C., et al. (2025). Adversarial machine learning: a review of methods, tools, and critical industry sectors. Artif. Intellig. Rev. 58. doi: 10.1007/s10462-025-11147-4

Crossref Full Text | Google Scholar

Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., et al. (2024). “Grounding multimodal large language models to the world,” in International Conference on Learning Representations (Vienna: OpenReview).

Google Scholar

Perez, E., Ringer, S., Lukoit, K., Nguyen, K., Chen, E., Heiner, S., et al. (2023). Discovering language model behaviors with model-written evaluations. arXiv [preprint] arXiv:2212.09251. doi: 10.48550/arXiv.2212.09251

Crossref Full Text | Google Scholar

Pfeifer, R., and Bongard, J. (2007). How the Body Shapes the Way We Think: A New View of Intelligence. Cambridge, MA: MIT Press.

Google Scholar

Pulvermller, F. (2013). How neurons make meaning: Brain mechanisms for embodied and abstract-symbolic semantics. Trends Cognit. Sci. 17, 458–470. doi: 10.1016/j.tics.2013.06.004

PubMed Abstract | Crossref Full Text | Google Scholar

Pylyshyn, Z. W. (2003). Seeing and Visualizing: It's Not What You Think. Cambridge, MA: MIT Press. doi: 10.7551/mitpress/6137.001.0001

Crossref Full Text | Google Scholar

Qureshi, R., Sapkota, R., Shah, A., Muneer, A., Zafar, A., Vayani, A., et al. (2025). Thinking beyond tokens: from brain-inspired intelligence to cognitive foundations for artificial general intelligence and its societal impact. arXiv [preprint] arXiv:2507.00951v003. doi: 10.48550/arXiv.2507.00951

Crossref Full Text | Google Scholar

Reed, S., Koerding, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., et al. (2022). A generalist agent. arXiv [preprint] arXiv:2205.06175. doi: 10.48550/arXiv.2205.06175

Crossref Full Text | Google Scholar

Reid, T., Grigsby, J., Stooke, A. Z., Hafner, D., Fischbacher, T., Morgan, D. S., et al. (2024). Genie: Generative interactive environments. arXiv [preprint] arXiv:2402.15391. doi: 10.48550/arXiv.2402.15391

Crossref Full Text | Google Scholar

Reinboth, T., and Farkaš, I. (2022). Ultimate grounding of abstract concepts: A graded account. J. Cognit. 5:21. doi: 10.5334/joc.214

PubMed Abstract | Crossref Full Text | Google Scholar

Ren, K., Zheng, T., Qin, Z., and Liu, X. (2020). Adversarial attacks and defenses in deep learning. Engineering 6, 346–360. doi: 10.1016/j.eng.2019.12.012

Crossref Full Text | Google Scholar

Šejnová, G., Tesař, M., and Vavrečka, M. (2018). Compositional models for VQA: Can neural module networks really count? Procedia Comp. Sci. 145, 481–487. doi: 10.1016/j.procs.2018.11.110

Crossref Full Text | Google Scholar

Šejnová, G., Vavrečka, M., and Štepánová, K. (2024). “Bridging language, vision and action: Multimodal VAEs in robotic manipulation tasks,” in IEEE/RSJ International Conference on Robots and Systems (IROS) (Abu Dhabi: IEEE), 12522–12528.

Google Scholar

Szot, A., Mazoure, B., Agrawal, H., Hjelm, R. D., Kira, Z., and Toshev, A. (2024). Grounding multimodal large language models in actions. arXiv [preprint] arXiv:2406.07904. doi: 10.48550/arXiv.2406.07904

Crossref Full Text | Google Scholar

Szot, A., Mazoure, B., Attia, O., Timofeev, A., Agrawal, H., Hjelm, D., et al. (2025). From multimodal LLMs to generalist embodied agents: methods and lessons. arXiv [preprint] arXiv:2412.08442. doi: 10.1109/CVPR52734.2025.00995

Crossref Full Text | Google Scholar

Team, G. R., Abeyruwan, S., and Zhou, Y. (2025). Gemini robotics: bringing AI into the physical world. arXiv [preprint] arXiv:2503.20020.

Google Scholar

Tellex, S., Knepper, R. A., Li, A., Rus, D., and Roy, N. (2020). Asking for help using inverse semantics. Int. J. Robot. Res. 39, 74–92.

Google Scholar

Tomasello, M. (1999). The Cultural Origins of Human Cognition. Cambridge, MA: Harvard University Press.

Google Scholar

van Elk, M., van Schie, H., and Bekkering, H. (2013). Action semantics: A unifying conceptual framework for the selective use of multimodal and modality-specific object knowledge. Phys. Life Rev. 11, 220–250. doi: 10.1016/j.plrev.2013.11.005

PubMed Abstract | Crossref Full Text | Google Scholar

Vavrečka, M., and Farkaš, I. (2014). A multimodal connectionist architecture for unsupervised grounding of spatial language. Cognit. Comput. 6, 101–112. doi: 10.1007/s12559-013-9212-5

Crossref Full Text | Google Scholar

Vavrečka, M., Sokovnin, N., Mejdrechová, M., and Šejnová, G. (2021). “myGym: Modular toolkit for visuomotor robotic tasks,” in IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI) (Washington, DC: IEEE), 834–839. doi: 10.1109/ICTAI52525.2021.00046

Crossref Full Text | Google Scholar

Vemprala, S. H., Bonatti, R., Bucker, A., and Kapoor, A. (2024). ChatGPT for robotics: Design principles and model abilities. IEEE Access 12, 55682–55696. doi: 10.1109/ACCESS.2024.3387941

Crossref Full Text | Google Scholar

Wang, B., Xu, C., Wang, S., Gan, Z., Cheng, Y., Gao, J., et al. (2021). “Adversarial glue: A multi-task benchmark for robustness evaluation of language models,” in 35th Conference on Neural Information Processing Systems (NeurIPS) (Curran Associates, Inc.).

Google Scholar

Wu, J., Zhong, M., Xing, S., Lai, Z., Liu, Z., Chen, Z., et al. (2024). VisionLLM v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks. arXiv:2406.08394. doi: 10.48550/arXiv.2406.08394

Crossref Full Text | Google Scholar

Yee, E. (2019). Abstraction and concepts: when, how, where, what and why? Lang. Cognit. Neurosci. 34, 1257–1265. doi: 10.1080/23273798.2019.1660797

Crossref Full Text | Google Scholar

Yu, T., Lin, X. V., Yang, Z., Yavuz, S., Li, P. P., and Li, X. V. (2020). “GraPPa: Grammar-augmented pre-training for table semantic parsing,” in Conference of the Association for Computational Linguistics (ACL), 1339–1352.

Google Scholar

Zador, A. M. (2019). A critique of pure learning: What artificial neural networks can learn from animal brains. Nat. Rev. Neurosci. 24, 77–91. doi: 10.1101/582643

Crossref Full Text | Google Scholar

Zhang, Y., Liu, X., Chen, Z., Yu, J., and Chai, J. (2023). “Grounding visual illusions in language: Do vision-language models perceive illusions like humans?,” in Conference on Empirical Methods in Natural Language Processing (Singapore: ACL), 2571–2585.

Google Scholar

Zhao, Q., Lu, Y., Kim, M. J., Fu, Z., Zhang, Z., Wu, Y., et al. (2025). CoT-VLA: Visual chain-of-thought reasoning for vision-language-action models. arXiv:2503.22020. doi: 10.1109/CVPR52734.2025.00166

Crossref Full Text | Google Scholar

Zhao, X., Li, M., Lu, W., Weber, C., Lee, J. H., Chu, K., et al. (2024). “Enhancing zero-shot chain-of-thought reasoning in large language models through logic,” in Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING) (Torino: ELRA and ICCL), 6144–6166.

Google Scholar

Zhu, Y., Gao, T., Fan, L., Huang, S., Edmonds, M., Liu, H., et al. (2020). Dark, beyond deep: A paradigm shift to cognitive AI with humanlike common sense. Engineering 6, 310–345. doi: 10.1016/j.eng.2020.01.011

Crossref Full Text | Google Scholar

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv:2307.15043v152. doi: 10.48550/arXiv.2307.15043

Crossref Full Text | Google Scholar

Keywords: symbol grounding problem, embodied cognition, large language model, modalities, integration, development

Citation: Farkaš I, Vavrečka M and Wermter S (2025) Will multimodal large language models ever achieve deep understanding of the world? Front. Syst. Neurosci. 19:1683133. doi: 10.3389/fnsys.2025.1683133

Received: 10 August 2025; Accepted: 17 October 2025;
Published: 17 November 2025; Corrected: 24 November 2025.

Edited by:

Yan Mark Yufik, Virtual Structures Research Inc., United States

Reviewed by:

Yufen Wei, Bangor University, United Kingdom
Yuyan Xue, University of Cambridge, United Kingdom

Copyright © 2025 Farkaš, Vavrečka and Wermter. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Igor Farkaš, aWdvci5mYXJrYXNAZm1waC51bmliYS5zaw==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.