Your new experience awaits. Try the new design now and help us make it even better

REVIEW article

Front. Robot. AI

Sec. Computational Intelligence in Robotics

Volume 12 - 2025 | doi: 10.3389/frobt.2025.1668910

This article is part of the Research TopicEmbodied Intelligence for Adaptive and Lifelong Robotic SystemsView all articles

A Review of Embodied Intelligence Systems: A Three-Layer Framework Integrating Multimodal Perception, World Modeling, and Structured Strategies

Provisionally accepted
  • Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China

The final, formatted version of the article will be published soon.

Embodied intelligent systems build upon the foundations of behavioral robotics and classical cognitive architectures. They integrate multimodal perception, world modeling, and adaptive control to support closed-loop interaction in dynamic and uncertain environments. Recent breakthroughs in Multimodal Large Models (MLMs) and World Models (WMs) are profoundly transforming this field, providing the tools to achieve its long-envisioned capabilities of semantic understanding and robust generalization. Targeting the central challenge of how modern MLMs and WMs jointly advance embodied intelligence, this review provides a comprehensive overview across key dimensions, including multimodal perception, cross-modal alignment, adaptive decision-making, and Sim-to-Real transfer. Furthermore, we systematize these components into a three-stage theoretical framework termed "Dynamic Perception–Task Adaptation (DP-TA)". This framework integrates multimodal perception modeling, causally driven world state prediction, and semantically guided strategy optimization, establishing a comprehensive "perception–modeling–decision" loop. To support this, we introduce a "Feature-Conditioned Modal Alignment (F-CMA)" mechanism to enhance cross-modal fusion under task constraints.

Keywords: embodied AI, multimodal learning, World Models, Cross-Modal Learning, reinforcement learning, sim-to-real transfer

Received: 18 Jul 2025; Accepted: 21 Oct 2025.

Copyright: © 2025 Zhang, Tian and Xiong. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: YunWei Zhang, zhangyunwei72@gmail.com

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.