REVIEW article
Front. Robot. AI
Sec. Computational Intelligence in Robotics
Volume 12 - 2025 | doi: 10.3389/frobt.2025.1668910
This article is part of the Research TopicEmbodied Intelligence for Adaptive and Lifelong Robotic SystemsView all articles
A Review of Embodied Intelligence Systems: A Three-Layer Framework Integrating Multimodal Perception, World Modeling, and Structured Strategies
Provisionally accepted- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Embodied intelligent systems build upon the foundations of behavioral robotics and classical cognitive architectures. They integrate multimodal perception, world modeling, and adaptive control to support closed-loop interaction in dynamic and uncertain environments. Recent breakthroughs in Multimodal Large Models (MLMs) and World Models (WMs) are profoundly transforming this field, providing the tools to achieve its long-envisioned capabilities of semantic understanding and robust generalization. Targeting the central challenge of how modern MLMs and WMs jointly advance embodied intelligence, this review provides a comprehensive overview across key dimensions, including multimodal perception, cross-modal alignment, adaptive decision-making, and Sim-to-Real transfer. Furthermore, we systematize these components into a three-stage theoretical framework termed "Dynamic Perception–Task Adaptation (DP-TA)". This framework integrates multimodal perception modeling, causally driven world state prediction, and semantically guided strategy optimization, establishing a comprehensive "perception–modeling–decision" loop. To support this, we introduce a "Feature-Conditioned Modal Alignment (F-CMA)" mechanism to enhance cross-modal fusion under task constraints.
Keywords: embodied AI, multimodal learning, World Models, Cross-Modal Learning, reinforcement learning, sim-to-real transfer
Received: 18 Jul 2025; Accepted: 21 Oct 2025.
Copyright: © 2025 Zhang, Tian and Xiong. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: YunWei Zhang, zhangyunwei72@gmail.com
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.