Multimodal perception-driven decision-making for human-robot interaction: a survey

Zhao, Wenzheng; Gangaraju, Kruthika; Yuan, Fengpei

doi:10.3389/frobt.2025.1604472

REVIEW article

Front. Robot. AI, 22 August 2025

Sec. Human-Robot Interaction

Volume 12 - 2025 | https://doi.org/10.3389/frobt.2025.1604472

Multimodal perception-driven decision-making for human-robot interaction: a survey

Wenzheng Zhao

Kruthika Gangaraju

Fengpei Yuan*

Department of Robotics Engineering, Worcester Polytechnic Institute, Worcester, MA, United States

Multimodal perception is essential for enabling robots to understand and interact with complex environments and human users by integrating diverse sensory data, such as vision, language, and tactile information. This capability plays a crucial role in decision-making in dynamic, complex environments. This survey provides a comprehensive review of advancements in multimodal perception and its integration with decision-making in robotics from year 2004–2024. We systematically summarize existing multimodal perception-driven decision-making (MPDDM) frameworks, highlighting their advantages in dynamic environments and the methodologies employed in human-robot interaction (HRI). Beyond reviewing these frameworks, we analyze key challenges in multimodal perception and decision-making, focusing on technical integration and sensor noise, adaptation, domain generalization, and safety and robustness. Finally, we outline future research directions, emphasizing the need for adaptive multimodal fusion techniques, more efficient learning paradigms, and human-trusted decision-making frameworks to advance the HRI field.

1 Introduction

The integration of robots into diverse domains such as healthcare, industrial manufacturing, transportation, and domestic environments has accelerated dramatically in recent years. Across these applications, robots serve various purposes—from providing companionship and assistance to enabling complex collaborations with human users. Despite this diversity of contexts and functions, a fundamental requirement remains consistent: robots must interact appropriately with humans in their specific operational environments. Effective human-robot interaction (HRI) depends critically on a robot’s ability to accurately perceive and understand human users’ status, intentions, and preferences, as well as the surrounding environment, before making appropriate decisions to achieve intended goals. This perception must then inform appropriate decision-making and action planning to achieve specific interaction goals. Consequently, the integration of multimodal perception and decision-making has emerged as a cornerstone of modern HRI research.

However, achieving accurate perception and robust decision-making in HRI remains a significant challenge due to the inherent complexity, dynamism, and variability of human behavior (Amiri et al., 2020), individual preferences, habits, capabilities (Ji et al., 2020), and environments (Diab and Demiris, 2024). Recent advancements in multimodal perception models, such as those leveraging deep learning and large-scale vision-language frameworks (Lu et al., 2024; OpenAI, 2023), coupled with increased computational power, have significantly enhanced robotic capabilities in these areas (Kim et al., 2024; Zhou et al., 2025). These developments have enabled robots to process and fuse data from multiple sensory modalities—such as vision, speech, touch, and proprioception—to form a more comprehensive understanding of their environment and human counterparts. Despite these advancements, the integration of multimodal perception with decision-making frameworks remains an open and actively researched problem, particularly in the context of embodied intelligence for HRI.

While several surveys have explored aspects of HRI, such as multimodal perception (Wang and Feng, 2024), human behavior modeling (Robinson et al., 2023; Reimann et al., 2024), and industrial applications (Duan et al., 2024; Jahanmahin et al., 2022; Bonci et al., 2021), there is a notable gap in the literature. Existing reviews often focus on specific domains, such as manufacturing (Duan et al., 2024; Wang and Feng, 2024; Jahanmahin et al., 2022; Bonci et al., 2021), or narrow aspects of HRI, such as vision (Robinson et al., 2023) or dialogue management (Reimann et al., 2024). To our knowledge, no comprehensive survey has systematically examined the interplay between multimodal perception and decision-making across diverse application domains, including healthcare, manufacturing, and transportation. This gap motivates our work.

In this survey, we present a comprehensive review of over 2 decades of research based on Multimodal Perception-Driven Decision-Making (MPDDM) method in embodied intelligence for HRI. Our primary objective is to analyze how these systems leverage multimodal perception to enable more efficient and accurate decision-making. Specifically, we systematically examine: (1) the sources and types of multimodal sensing data, (2) methodologies for data fusion and perception, (3) decision-making frameworks, and (4) architectures that integrate perception and decision-making. Through this analysis, we identify key challenges and limitations in current approaches and propose potential directions for future research.

Our contributions are threefold:

1. Comprehensive Cross-Domain Coverage: Unlike existing surveys that focus on specific domains, our work synthesizes HRI research across diverse application areas, including industrial manufacturing, healthcare, domestic settings, transportation, and other application areas. This cross-domain perspective provides HRI researchers with a holistic understanding of the current state of technology and methodology in multimodal perception and decision-making, potentially enabling cross-pollination of ideas between domains.

2. Focus on Multimodal Perception: While many reviews emphasize single modalities (e.g., vision or speech), our survey highlights the growing importance of multimodal perception in robotics and HRI. We explore how integrating multiple sensory modalities can enhance perception and decision-making.

3. Integration of Perception and Decision-Making: Our review not only examines multimodal perception but also discusses decision-making frameworks and their integration with perception. This dual focus offers valuable insights for researchers seeking to understand the interplay between these critical components in HRI systems.

By addressing these aspects, our survey aims to serve as a foundational resource for researchers and practitioners in the HRI community, facilitating the development of more robust and context-aware robotic systems.

This survey is structured as follows: Section 2 introduces the study selection process, including database searching, search strategies, and filtering criteria. Section 3 presents the survey findings, discussing the role of multimodal perception in decision-making, strategies for multimodal sensing data fusion, the MPDDM framework, and decision-making methods explored in previous research. Section 4 highlights key challenges, limitations, and potential future research directions in MPDDM within the HRI domain.

2 Methodology

2.1 Search and selection strategy

To ensure a comprehensive and systematic review, we followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Moher et al., 2009). Figure 1 shows the process of identification, screening, eligibility, and inclusion in this survey. Our search strategy incorporated multiple electronic databases, including Google Scholar, SpringerLink, Web of Science, IEEE Xplore, ScienceDirect, ACM Digital Library, and Scopus, to identify relevant literature on multimodal perception-driven decision-making in human-robot interaction (HRI).

Figure 1

Flowchart detailing a systematic review process. Identification phase includes sources: ACM Digital Library (43), Google Scholar (502), SpringerLink (30), ScienceDirect (30), Scopus (3), Web of Science (2), IEEE Xplore (1), totaling 641 articles. Screening phase excludes 333 articles for being non-English, duplicate, or lacking key research elements, leaving 278 eligible articles. Eligibility phase excludes 212 articles for lacking multimodal perception discussion, technical details, or case studies. Included phase results in 66 articles for literature review.

Figure 1. PRISMA flow diagram of the study selection process.

We constructed Boolean search queries based on key terms and their variations to maximize relevant results. The primary search query used in all databases was: (“multimodal perception” OR “multi-modal perception” OR “multisensory perception”) AND “human-robot interaction” AND (“decision-making” OR “decision making”). To refine our search, we applied the studies published from 2004 to 2024.

Using this search query, we obtained 502 hits from Google Scholar, 30 hits from SpringerLink, two hits from Web of Science, one hit from IEEE Xplore, 43 hits from ACM Digital Library, 30 hits from ScienceDirect, and three hits from Scopus. After removing duplicates, 511 articles remained for screening. Upon reviewing the article abstracts and written language, we excluded 233 articles for the following reasons: (1) non-English language, or (2) lacking key research elements for this survey (multimodal perception, human-robot interaction, and decision-making). Thus, 278 articles proceeded to the eligibility review stage. From these, we selected 66 studies that met the following inclusion criteria: (1) detailed work on integrating multimodal perception and decision-making, specifically how multimodal perception aids robots in decision-making for human-robot interaction, (2) inclusion of technical implementation details, including multimodal fusion techniques and perception-driven decision-making methodologies, and (3) concrete case studies or experimental data demonstrating practical human-robot interaction applications.

3 Results

To systematically analyze the 66 selected papers following the PRISMA guidelines, we categorized and synthesized each study based on its application domain, multimodal data types, data fusion techniques, and decision-making approaches by leveraging multimodal perception. Specifically, for each paper, we (1) provide a concise summary of its application, (2) identify the types of multimodal data utilized (e.g., vision, audio, language information), (3) classify and analyze the data fusion techniques, distinguishing between Model-Agnostic and Model-Based approaches (see Section 3.3 for details), and (4) examine the decision-making strategies employed (see Section 3.5 for further discussion), with a focus on how multimodal data contributes to improved decision-making performance. A detailed breakdown of each study is presented in Table 3.

3.1 Application domains of MPDDM in HRI

Multimodal Perception-Driven Decision-Making (MPDDM) plays a crucial role in various HRI applications. By integrating multimodal perception techniques with decision-making frameworks, robots can operate in dynamic and complex environments with improved adaptability, reliability, robustness, and efficiency. Based on the reviewed literature, MPDDM applications in HRI can be categorized into four primary domains: social and assistive robotics, navigation and mobile robotics, industrial collaboration robotics, and general-purpose robotics with high-level task planning and reasoning. Furthermore, the MPDDM application domains mentioned above exhibit distinct strengths and challenges. Table 1 provides a structured comparison of these domains, highlighting their key advantages as well as limitations and practical challenges, to serve as a reference for future research and application design.

Table 1

Table 1. Summary of key advantages and limitations/challenges for major MPDDM application domains.

3.1.1 Social and assistive robotics

Social and assistive robotics are extensively employed in social services, primarily for social interaction, emotion recognition, speech-based dialogue, assistive healthcare, and rehabilitation robotics. These systems aim to enhance user experience and engagement in HRI and provide companion, care, and/or assistance. For instance, previous work designed proactive social robots capable of responding to human emotions (Al-Qaderi and Rad, 2018a), situational states (Vauf et al., 2016), and spatial cues (Ch et al., 2022). Similarly (Tang et al., 2015), developed a companion robot based on a multimodal communication architecture for the elderly. In the field of medical assistance and rehabilitation, researchers explore the potential of MPDDM, for example (Yuan et al., 2024), developed a social robotic framework based on Pepper robot (Pandey and Gelin, 2018) for assisting persons with Alzheimer’s dementia in executing self-care tasks, aiming to enhance their ability to complete daily routines. Additionally (Qin et al., 2023), designed a domestic service interactive robot system, integrating touch, speech, electromyographic gestures, visual gestures, and haptic information, explicitly aiming at individuals with declined expressive abilities.

3.1.2 Navigation and mobile robotics

Autonomous navigation and mobile robotics leverage robotic autonomy and pre-acquired environmental knowledge to facilitate human convenience. For instance, autonomous mobile robots utilizing multimodal perception for obstacle avoidance and navigation (Zhang Y. et al., 2024; Sha, 2024; Wang, 2023; Chen et al., 2020; Roh, 2022; Xie and Dames, 2023) have been extensively studied. These studies have experimentally demonstrated that multimodal perception enhances model robustness, compensating for missing sensory modalities in dynamic environments. Meanwhile, other studies, such as (Panigrahi et al., 2023; Song D. et al., 2024; Siva and Zhang, 2022), focus on socially aware navigation. These works integrate vision, speech, and social signal analysis to enable robots to predict pedestrian trajectories, facilitating socially adaptive and human-friendly navigation strategies.

3.1.3 Industrial collaborative robotics

Industrial collaborative robotics primarily aim to enhance worker efficiency and reduce labor costs by integrating collaborative robots into manufacturing processes. This field includes classic human-robot collaborative assembly tasks (Ji et al., 2020; Forlini et al., 2024; Li et al., 2021; Belcamino et al., 2024) and palletizing robot (Baptista et al., 2024). Other research focuses on leveraging multimodal information to understand and manipulate objects. For instance (Zhang X. et al., 2023), and (Lu et al., 2023) investigate multimodal attribute learning, where robots combine visual, auditory, and haptic data to classify and recognize object properties. Once object attributes are successfully identified, the next challenge is to determine how to grasp and manipulate these objects in dynamic environments. Many researchers adopt Markov Decision Processes (MDP), such as (Amiri et al., 2018) and (Zhang et al., 2021), or reinforcement learning-based models, such as (Balakuntala et al., 2021), to dynamically update robotic actions based on multimodal sensory feedback. More recently, end-to-end learning models, such as (Zhang Z. et al., 2023), have been explored for policy generation in manipulation tasks.

3.1.4 General-purpose robotics with high-level task planning and reasoning

Unlike domain-specific applications, some research focuses on general task planning and decision reasoning across different HRI scenarios. Here, we examine how MPDDM can enable high-level planning beyond single-modal approaches. Traditional task-planning methods in robotics rely heavily on single-modal decision systems. However, in real-world environments, robots encounter uncertainties, dynamic human interactions, and ambiguous sensory inputs, making single-modal task planning insufficient. To address this, previous studies have integrated multimodal sensing data, such as visual, auditory, linguistic, and proprioceptive data, to enhance robotic task planning and situational reasoning. For instance (Forlini et al., 2024; Zhang Z. et al., 2023; Mei et al., 2024), and (Song Y. et al., 2024) leverage GPT/VLM models for semantic task parsing, enabling robots to utilize large-scale vision-language models (VLMs) for end-to-end dynamic task planning. Furthermore, these systems incorporate error correction mechanisms, which allow real-time task adjustments during execution. Beyond predefined task planning, robots operating in unstructured environments must develop situational awareness (Diab and Demiris, 2024), so that robots can adapt their tasks based on real-time environmental states (Amiri et al., 2020; Zhang X. et al., 2024).

3.2 Justification of multimodal perception

3.2.1 Multimodal perception

Multimodal perception refers to the study of methods for processing heterogeneous and interconnected data, encompassing both raw signals (e.g., speech, language, images) and abstract concepts (e.g., emotions). By integrating different modalities, humans can better perceive and interpret environmental information. Multimodal perception can be categorized into six primary types: language, vision, touch, acoustic, physiological, and mobile (Liang et al., 2022). Over the past decade, the rapid advancement of deep learning and embodied intelligence has significantly propelled the progress of multimodal perception-driven decision-making. In particular, the emergence of large-scale foundation models such as ChatGPT and Vision-Language-Action (VLA) frameworks has led to a new peak in multimodal development. Currently, most multimodal research focuses on vision and language integration, as researchers aim to enable robots to communicate and interpret the world similarly to humans, leveraging both linguistic reasoning and visual observation to interact with their surroundings. Such cross-modal integration enhances a robot’s ability to comprehend complex scenarios and improves system robustness in the absence of certain sensory inputs. For example, in autonomous driving, if a robot loses radar data in a complex environment, it must still navigate safely using alternative sensory inputs (Camera, etc.) (Grigorescu et al., 2020). Thus, understanding the cognitive processes involved in multimodal data fusion is essential for the future of embodied artificial intelligence.

3.2.2 Advantages of multimodal perception

Single-modal perception (e.g., vision-only, speech-only, or touch-only) has played a role in early research applications. Still, it remains significantly limited in real, complex human-robot interaction scenarios (Huang et al., 2021). The limitations can be delineated as follows: (1) Limited Information: A single sensor provides a restricted perceptual dimension (Wang et al., 2024), making capturing global or deep semantic information challenging. (2) Poor Robustness: Single-modal systems are highly susceptible to noise, lighting changes, occlusions, or hardware failures, leading to performance degradation (Wang et al., 2024). (3) Lack of Accuracy and Generalization: In complex, dynamic environments, single-modal algorithms struggle to maintain high accuracy or adapt quickly (Huang et al., 2021). (4) Inability to Capture Multifaceted Human/Environment Information: Human language, emotions, intentions, and actions involve multiple signals, which a single modality alone cannot fully comprehend (Su et al., 2023).

Due to these limitations, researchers have increasingly focused on multimodal perception (Duncan et al., 2024) in recent years, aiming to integrate information from different types of sensors to handle complex, dynamic HRI scenarios. By integrating different modalities such as text/speech, vision, audio, touch, and physiological signals, multimodal perception offers advantages such as information complementarity and enhanced robustness. For instance (Al-Qaderi and Rad, 2018a), and (Churamani et al., 2020) demonstrated that combining auditory cues with visual inputs improved the accuracy of recognizing personal emotion and location compared to under visual-based detection conditions. Similarly (Granata et al., 2012), leveraged vision and speech to enhance the accuracy and robustness of human detection and interaction in complex scenarios. Furthermore (Wang, 2023), and (Khandelwal et al., 2017) found that multisensory data from both RGB-D cameras and LiDAR sensors mitigated the instability of visual-only systems and improved the robustness of navigation. Beyond robustness, multimodal perception also enhances contextual and semantic comprehension in HRI scenarios. Just like humans, who rely on the integration of multiple modalities (e.g., hearing, vision, smell) to better interpret their surroundings, multimodal perception enables robots to achieve a more comprehensive, accurate understanding of environmental states. For example (Zhang et al., 2021), enabled robots to explore and describe objects in the environment as humans by using audio, haptics, and vision, this approach improved object description accuracy by 50% compared to vision-only exploration.

In summary, multimodal perception not only addresses the inherent weaknesses of unimodal perception but also broadens the scope of MPDDM applications, paving the way for richer human-robot collaboration. However, alongside these advantages, multimodal information also introduces challenges, such as the complexity of data fusion, multimodal representation learning—how to utilize multimodal information effectively, alignment—how to model connections across modalities to ensure accurate understanding and integration, and reasoning—how different modalities interact to influence the decision-making process. These challenges and methods will be explored in Sections 3.3,3.4.

3.3 Multimodal sensing data fusion strategies

Multimodal data fusion represents the cornerstone of effective perception-driven decision-making in human-robot interaction systems. This process involves systematically combining information streams from diverse sensing modalities to have a comprehensive representation of the environment. Successful fusion strategies enable robots to overcome the limitations of individual sensors, enhance perceptual robustness in challenging conditions, and develop a more complete understanding of complex human-robot interaction scenarios. In this section, we examine the primary approaches to multimodal data fusion following the framework established by (Baltrušaitis et al., 2018). The fusion methodologies can be broadly categorized into two fundamental classes: model-agnostic methods and model-based methods. Model-agnostic approaches offer flexibility across different learning paradigms, while model-based techniques integrate fusion mechanisms directly within the learning architecture.

Model-agnostic methods typically operate at distinct stages of the perception pipeline, with fusion occurring at the data level (early fusion), feature level (intermediate fusion), decision level (late fusion), or through hybrid combinations spanning multiple processing stages. Meanwhile, model-based methods leverage the inherent capabilities of neural networks, kernel methods, or probabilistic graphical models to learn optimal fusion strategies during the training process. The following subsections detail these approaches, examining their theoretical foundations, implementation considerations, and relative advantages in various HRI contexts.

3.3.1 Model-agnostic methods

Early Fusion (Data-Level): In early fusion, raw or minimally processed data from different modalities are combined into a single input representation at the earliest stage. For example, in a long-term social interaction bartending robot task (Rossi et al., 2024), enhanced the robot’s natural interactive operations by incorporating speech and facial expressions. Similarly (Nan et al., 2019), employed early fusion of RGB and depth images by aligning them based on time frames to improve elderly action recognition. The early-fusion method allows the subsequent model (or pipeline) to learn cross-modal correlations directly from the original data, but it may become challenging to handle large discrepancies or noise across modalities.

Intermediate Fusion (Feature-Level): Features are first extracted independently from each modality, and then these feature representations are fused. This approach balances flexibility and complexity—each modality can be processed separately with tailored feature extraction techniques, and the combined feature space typically captures richer, modality-specific information before final decision-making (Ji et al., 2020). Demonstrated that feature-level fusion of vision, depth, and inertial sensors enables reliable perception by capturing information about humans, robots, and the environment in industrial human-robot collaboration (HRC) (Schmidt-Rohr et al., 2008a). Improved the accuracy of “person of interest” recognition and ensured stable autonomous navigation by converting raw sensory inputs (speech, human activity from RGB-D, and LiDAR) into probability distributions, enhancing dynamic confidence from different feature levels (Scicluna et al., 2024). Showed that aligning 2D feature bounding boxes from RGB with LiDAR depth features prevented false positive detections from leading to incorrect decisions. Similarly, studies such as (Banerjee et al., 2018), (Zhao et al., 2024), and (Deng et al., 2024) enhanced perception capabilities by employing various fusion strategies, including weighting, concatenation, and heuristic-based algorithms.

Late Fusion (Decision-Level): Late fusion focuses on merging the outputs (i.e., decisions or predictions) from multiple models or classifiers. Each modality is modeled separately and the final results are combined (e.g., by voting, averaging, or a learned ensemble). For example (Siqueira et al., 2018), separately predicted emotion and language models, then evaluated the recognition results using a decision framework to resolve emotional mismatches. Similarly (Granata et al., 2013), merged information extracted from four detectors using weighted criteria based on the field of view, reducing the instability of motion prediction when sensor data was incomplete. Late fusion often offers greater robustness if one modality performs poorly, but it may miss certain cross-modal interactions that arise earlier in the data or feature space.

Hybrid Fusion (Combining Multiple Stages): Hybrid fusion integrates multiple fusion strategies. For instance, combining early or intermediate fusion with late fusion. By doing so, it aims to leverage the best of both worlds: capturing cross-modal correlations and ensuring robust final decisions. For example (Sha, 2024), performed feature-level fusion by assigning weights to RGB object detection, depth cameras, and ultrasonic sensors, then integrated obstacle detection and path planning module outputs at the decision level. Similarly (Dean-Leon et al., 2017), applied fusion at multiple stages, including signal level, feature level, and symbolic representation, to enhance collision avoidance, compliance, and grasping strategies. Overall, hybrid fusion approaches aim to leverage the advantages of different fusion stages and integrate them effectively.

3.3.2 Model-based fusion

In this category, the fusion process is driven by a learned model—often nonlinear—such as probabilistic methods, kernel-based methods, neural networks (e.g., CNNs, Transformers), or graph-based models. These approaches learn how to integrate or attend to relevant information across modalities through training, allowing more adaptive and potentially more powerful multimodal representations. For example (Dağlarlı, 2020), utilized probabilistic reasoning and attention mechanisms to integrate multimodal perception data (Zhang et al., 2021). Dynamically constructed a partially observable Markov decision process (POMDP) that integrates information from different sensory modalities and actions to compute the optimal policy.

In the domain of neural network approaches (Yu, 2021), employed CNNs and GANs for gesture/facial synthesis and a hybrid classifier for emotion recognition (Wang, 2023). Utilized an attention mechanism to integrate visual and temporal multimodal features (Yas et al., 2024). extracted RGB and depth data and fused them with skeletal features using a context attention mechanism. Similarly (Al-Qaderi and Rad, 2018b), used spiking neural networks (SNN) to process feature vectors from different modalities, including RGB (FERET), RGB-D (TIDIGITS), and RGB-D (3D body and depth).

Other learning-based approaches include reinforcement learning and graph-based methods (Cuayáhuitl, 2020). Applied deep reinforcement learning (DQN) to fuse visual perception and speech interaction (Ferreira et al., 2012). Employed a graph-based Bayesian hierarchical model to fuse visual and auditory perception. Similarly (Ivaldi et al., 2013), integrated vision, audition, and proprioception using graph-based incremental learning and sensorimotor loops.

More recently, large language models (LLMs) have been leveraged for multimodal fusion (Menezes, 2024). Leveraged, Generative Image-to-text Transformer (GIT) and GPT-4 for cross-modal alignment of visual, textual, and auditory inputs (Forlini et al., 2024). Utilized GPT-4V for feature extraction and decision-making based on visual and contextual inputs. And (Ly et al., 2024) employed an LLM-based planner to generate action sequences by integrating recognition results with motion feasibility.

In summary, multimodal data fusion can be achieved through various strategies—from a simple early fusion of raw data to advanced hybrid and model-based methods that dynamically learn cross-modal interactions. Table 2 summarizes the main fusion strategies discussed above, along with their key advantages and limitations, to provide a concise reference for readers. These approaches provide the foundation for robust and context-rich perception in HRI. In the next section, we explore how these fused representations are integrated into decision-making architectures, enabling robots to leverage the full potential of multimodal inputs for more intelligent and adaptive behavior.

Table 2

Table 2. Summary of multimodal fusion strategies, their advantages, and limitations.

3.4 Integration architectures for multimodal perception and decision-making

Multimodal perception can be integrated into decision-making processes for HRI through various architectural frameworks, ranging from conventional linear pipelines to more advanced adaptive models incorporating feedback loops and end-to-end learning. The selection of an appropriate architecture is contingent on multiple factors, including real-time processing constraints, system complexity, and the degree of adaptability required for a given task. For instance, simple feedforward pipelines may be suitable for low-latency applications (Vauf et al., 2016). In contrast, end-to-end frameworks or feedback architecture are often preferred in dynamic and uncertain environments where continuous adaptation is necessary (Mei et al., 2024; Forlini et al., 2024). Therefore, in this section, we analyze the rationale behind the selection of each architectural approach by synthesizing insights from selected papers and empirical findings. We will discuss how different architectures align with specific HRI tasks, the trade-offs they present in performance and adaptability. In multimodal perception-driven decision systems, both academia and industry commonly use five types of high-level architectures to integrate information and execute action decisions for robotics: pipeline architecture, feedback-loop architecture, modular architecture, end-to-end architecture, and hybrid architecture. The first four basic types of architecture have been shown in Figure 2, which illustrates the workflow of each approach.

Figure 2

Diagram illustrating four architecture approaches for multimodal input processing. (a) Pipeline Architecture: Inputs are fused, leading to decision making. (b) Feedback Architecture: Feedback loop for fusion adjustment. (c) Modular Architecture: Separate modules for perception, planning, and control integrating through a hub. (d) End-to-End Architecture: Multimodal inputs process directly in a model to produce decision output. Each approach routes inputs like images, speech, and gestures through different pathways for decision making.

Figure 2. Four basic integration architectures for multimodal perception and decision-making in human-robot interaction. (a) Pipeline architecture. (b) Feedback architecture. (c) Modular architecture. (d) End to end architecture.

3.4.1 Pipeline architecture

The pipeline processing architecture allows multiple modalities to be processed simultaneously, reducing processing latency. By handling different sensory inputs in parallel and integrating them through a coordination layer, the system can output multimodal results in real-time, feeding them directly into the planning and decision-making module, as illustrated in Figure 2 (see Subfigure a). This architecture is particularly advantageous in real-time interactive scenarios, where synchronized multimodal processing ensures fast and adaptive responses (Vauf et al., 2016). Implemented a multi-channel parallel processing architecture to detect whether a person intends to initiate interaction with the robot, which enables social companion robots to respond to human behavior more naturally in real-time. They enabled the same robot to acquire and process multiple sensory inputs in parallel, integrating data from different modalities: Laser scanner: 270° field of view, updated every 80 m (12.5Hz), mounted at the base of the Kompaï robot (captures spatial position and distance); Kinect sensor: RGB video (30Hz) for skeleton tracking and facial detection; Depth images (30Hz) for enhanced skeleton tracking; Microphone array: 8Hz for sound source localization and voice activity detection. All features were temporally aligned using an 80 m (12.5Hz) baseline, with data from different modalities fused via temporal synchronization and feature concatenation. The unified multimodal representation was then fed into a classifier (e.g., SVM or neural networks) to recognize interaction intent. They claimed this approach allows the robot to robustly and efficiently infer user engagement, ensuring real-time, natural, and adaptive responses in HRI scenarios.

3.4.2 Feedback-loop architecture

As illustrated in Figure 2 (see Subfigure b), this architecture incorporates feedback mechanisms where outputs from later stages (e.g., decision-making) influence earlier stages (e.g., perception or sensing). This approach enables adaptive and context-aware behavior, improving robustness by allowing the system to refine its perception based on decision outcomes. Such an architecture is particularly well-suited for real-time multimodal perception scenarios in complex human-robot interaction. For example, social robots continuously perceive user emotions and dynamically adjust their dialogue strategies (Ch et al., 2022), or collaborative robots refine grasping operations in real-time using force and visual feedback (Ji et al., 2020; Zhang X. et al., 2023) introduces a pipeline for robotic interaction and perception—the Multimodal Embodied Attribute Learning (MEAL) framework. MEAL enables robots to perceive object attributes—such as color, weight, and empty—through sequential multimodal exploratory behaviors (e.g., observing, lifting, and shaking). The framework is built on a Partially Observable Markov Decision Process (POMDP) for object attribute recognition, structured as follows: Action Selection: The robot selects the next exploratory action (e.g., look, grasp, shake) based on the current environmental state and its belief about the object’s attributes. Information Acquisition: The robot executes the chosen action and collects new sensory data across multiple modalities (e.g., visual, auditory, and tactile features). Belief Update: The system integrates new observations and user feedback (e.g., confirming or correcting attribute recognition) to update the POMDP belief state. In ONLINE-MEAL scenarios, newly collected data (features + labels) are also added to the training set to improve future perception models. Decision Re-evaluation: After updating its belief or model, the system reassesses whether further exploration is needed or if it can confidently report results, forming a closed-loop perception-decision-feedback cycle. Similarly (Zhang Y. et al., 2024), employs multi-source perception (RGB-D, QR codes, wheel odometry, etc.) to obtain the robot’s current state and detect obstacles ahead. The system continuously feeds obstacle location and distance information to the “Safe Manipulation” module in real time. Based on the relative distance and direction between the obstacle and the robot, this module dynamically adjusts the robot’s speed or triggers braking to ensure safe operation.

3.4.3 Modular architecture

The system is divided into independent, self-contained modules, each responsible for a specific function (e.g., sensing, perception, decision-making). Therefore, This architecture is well-suited for scalable robotic systems, particularly industrial collaborative robots. It consists of independent modules, such as a vision detection module (for object localization), a force/tactile sensing module (to ensure safe interaction with humans or objects), a motion planning module (for generating robotic arm trajectories), and a high-level decision-making module (for task allocation and anomaly handling). Each module can be individually upgraded or replaced without affecting the overall system framework. A typical implementation involves clear interfaces or communication protocols between modules, such as topics or services in ROS (Robot Operating System). For instance (Khandelwal et al., 2017), developed a modular and hierarchical general-purpose platform that integrates various independent modules, including mapping, robot actions, task planning, navigation, perception, and multi-robot coordination. Each module has a well-defined function and can be replaced without affecting the overall system. For example, the perception module uses a Kinect camera for human and object detection and LiDAR for environment perception. During real-world operation, the robot continuously perceives its surroundings, updating its knowledge state (via knowledge representation and reasoning nodes) and executing actions based on high-level planning. The key advantages of this architecture include ease of maintenance, upgradability, and flexibility for expanding to more complex tasks. However, a potential drawback is the added system overhead due to module coordination.

3.4.4 End-to-end architecture

End-to-end methods typically involve designing a unified neural network that directly maps sensor inputs to decision-making or action, without requiring manually engineered intermediate steps (Ly et al., 2024). Introduced an “end-to-end” high-level task planning architecture, which processes natural language instructions from humans and integrates visual perception and action feasibility verification. The system combines user input (U), visual observations (O), and feasibility scores (F) into a multimodal context, which is then fed into a fine-tuned Mistral 7B model to automatically generate and execute robot skill sequences (e.g., pick object, move to location, place object), directly mapping to atomic operations from the robot’s existing skill library. Additionally, the framework incorporates failure recovery and a human-in-the-loop mechanism. If the visual perception or feasibility detection module fails, the LLM prompts the user for guidance on handling the failure. The user can then provide new descriptions, suggest alternative objects, or manually reposition objects. The LLM subsequently generates a revised action sequence to ensure task completion. The advantage of this architecture is that it eliminates the need for complex feature engineering and pipeline framework construction. However, its drawbacks include lower interpretability and higher requirements for hardware and algorithms.

3.4.5 Hybrid architecture

The hybrid architecture combines elements from multiple architectural paradigms to leverage their respective strengths. For example, it may incorporate aspects of parallel processing, pipeline structures, and feedback loops. Effectively integrating these mechanisms at different stages or levels can balance real-time performance and flexibility. One example is the brain-inspired multimodal perception system proposed by (Al-Qaderi and Rad, 2018a), which follows a hybrid architecture approach: The system processes different modalities—vision (RGB and depth cameras) and audio (microphones)—in parallel through Dedicated Processing Units (DPUs), each responsible for specialized feature extraction and classification. Within each modality, information flows through a fixed-sequence pipeline, such as facial detection, skeletal tracking, feature extraction, and classification. Moreover, at a higher level, the system integrates spiking neural networks (SNNs) for temporal binding and top-down influences (e.g., using QR codes to infer possible identities), enabling a feedback loop to refine lower-level processing dynamically. This approach allows the system to handle multi-source inputs in parallel, while selectively activating or constraining lower layers based on intermediate recognition results. By doing so, it reduces unnecessary computation and improves response speed. Such a hybrid framework has demonstrated high adaptability and flexibility in dynamic HRI scenarios.

3.5 Decision-making methodologies

In the previous subsection, we examined how multimodal perception provides rich context for decision-making by integrating vision, audio, tactile, and other sensory inputs. However, perception alone does not complete the cycle: once the environment is understood, the robot must decide how to act in response (see Section 3.4 for Integration Architectures). Decision-making is a fundamental capability in intelligent systems, enabling robots and AI agents to infer contextual information by perceiving the environment, and then generate appropriate actions. For MPDDM, the system integrates information from multiple sensory modalities—such as vision, audio, touch, and language—to enhance robustness and adaptability in dynamic environments. With the rapid advancement of deep learning, symbolic reasoning, and probabilistic models, decision-making methods have become more adaptive and learning-based paradigms. Table 3 summarizes the decision-making method used by the MPDDM system. However, selecting the appropriate decision-making framework depends on the goal of the task, the level of uncertainty in the environment, and the training data. In this subsection, we summarize the decision-making methodologies that have been studied in HRI, from seven decision-making perspectives. Under each perspective, we presented each method’s distinct strengths and trade-offs in handling environmental uncertainty, data requirements, and the complexity of collaboration:

$•$ Learning-Based Paradigm: Supervised Learning, Reinforcement Learning, Imitation Learning, etc.

$•$ Problem Formulation Based: Markov Decision Process (MDP), Partially Observable MDP (POMDP), Game Theory, etc.

$•$ Symbolic/Logic-Based Approaches: Rule-Based Systems, Automated Planning (STRIPS, HTN), Knowledge Representation, etc.

$•$ Probabilistic Methods: Bayesian Networks, Hidden Markov Models (HMMs), etc.

$•$ Search-Based Planning: Path Planning, Monte Carlo Tree Search (MCTS), etc.

$•$ Generative AI Decision-Making: LLM/VLM-based Decision-Making, Generative Adversarial Imitation Learning (GAIL), etc.

$•$ Hybrid Approaches: Combine multiple methodologies

Table 3

Table 3. Summary of multimodal perception and decision-making methods in HRI.

3.5.1 Learning-based paradigm

Learning-based decision-making treats the robot (or agent) as a system that acquires policies or value functions from data. Common examples in HRI include supervised learning approaches (e.g., classification, regression), reinforcement learning (RL) for interactive tasks, and imitation learning from human demonstrations. For example, the robot learns to map sensor inputs to discrete actions (e.g., “stop,” “go,” “turn”) based on labeled training sets (Pequeño-Zurro et al., 2022). Similarly, in a collaborative assembly scenario, the robot explores different action strategies, receiving reward signals based on successful or failed assembly interactions (Chen et al., 2025). Alternatively, the robot can observe a human performing a skill and imitate it, adapting its behavior accordingly. For example (Churamani et al., 2020), designed an emotion-driven human-robot interaction system using neural network fusion. Their MCCNN (Multi-Channel Convolutional Neural Network) model consists of two independent channels for facial expression recognition and speech emotion recognition, which are then combined into a unified emotional representation. And then, reinforcement learning (RL) is employed to train the robot on negotiation strategies in the ultimatum game. Similarly (Lu et al., 2023), proposed a vision-language interactive grasping robot, leveraging a transformer-based cross-modal attention mechanism. This system integrates vision, text-based representations, and point cloud processing to enable precise object localization and interactive grasping. Likewise (Al-Qaderi and Rad, 2018b), utilized network-based fusion via a spiking neural network (SNN) to process feature vectors from multiple modalities. This approach enhances multimodal perception for social robots, enabling dynamic and reliable human recognition by selecting the most robust identification method based on the available sensory data. The core advantages of learning-based decision-making include adaptability to new tasks and improving with more data. However, a key drawback is the potentially large data requirement.

3.5.2 Problem formulation based methods

This approach primarily abstracts decision-making as a mathematical model, such as Markov Decision Processes (MDP), Partially Observable MDPs (POMDP), or game theory. Each formulation represents the agent’s state, actions, rewards, and uncertainties. For example (Amiri et al., 2018), employs MOMDPs to integrate multimodal data streams, modeling fully and partially observable state variables with updates from multimodal sensory feedback to optimize future decisions (Amiri et al., 2020). focuses on learning and reasoning for robot sequential decision-making under uncertainty, using a POMDP planner that leverages sensor data and contextual knowledge as priors to determine the optimal action for proactive HRI. Similarly (Zhang et al., 2021), applies a dynamically constructed POMDP to fuse information from different sensory modalities and actions to compute the best policy. The decision-making process is driven by the POMDP framework, which refines the robot’s belief state using multimodal sensory inputs and determines the optimal course of action. This approach’s advantage lies in its clear mathematical framework for problem representation. However, a key drawback is that large state spaces often lead to high computational complexity.

3.5.3 Symbolic/logic-based approaches

This approach relies on symbolic representations (rules, logic programs, knowledge bases) to plan or reason about actions. Examples include rule-based expert systems, automated planning like STRIPS or HTN, and knowledge representation with Answer Set Programming (ASP) or Programming in Logic (Prolog). For example (Diab and Demiris, 2024), designed an assistive HRI robot for daily life scenarios (kitchen tasks), integrating knowledge-based reasoning to support human-robot collaboration. The system utilizes object detection, spatial awareness, and environmental state assessment (e.g., detecting clutter) through a ROS-integrated version of YOLO trained on the COCO dataset. To enable model-based fusion, the framework constructs a Knowledge Graph (KG) that integrates semantic labels, relationships, and properties derived from multimodal data. By combining object detection and task goal identification, the robot understands the scene context and dynamically adapts its actions to align with human preferences, ensuring more intuitive and effective collaboration. This decision-making approach has the advantages of high-level interpretability, accurate capture of domain knowledge, and logically rigorous reasoning. The drawbacks include poor robustness to noise and potential fragility if the rules are incomplete.

3.5.4 Probabilistic methods

This approach primarily uses Bayesian networks, HMMs, factor graphs, or PGM-based methods to model stochastic processes, particularly for uncertainty in human states or the environment. The system continuously updates probabilities as new observations arrive. For example (Dağlarlı, 2020), employs Bayesian networks for cognitive perception, enabling robots to interact with humans and navigate dynamic environments as personal assistants. Similarly (Zhou and Wachs, 2019), utilizes HMMs to integrate EEG, EMG, body posture, and acoustic features, allowing early intent recognition for predictive decision-making. Likewise (Aly, 2014), applies CHMM-driven decision-making to synthesize synchronized gestures and prosody for naturalistic robot behavior. The key advantage of this approach is its principled handling of uncertainty, while the main drawback is the potentially high computational cost for large state spaces.

3.5.5 Search-based planning

Search-Based Planning algorithms (e.g., A*, D*, MCTS) compute a plan or policy by searching the state or action-space. For example (Khandelwal et al., 2017), combines probabilistic reasoning and planning (CORPP) to infer missing or ambiguous information, thereby reducing errors in understanding and navigation. The advantage of this approach is efficient exploration and planning in structured environments, while its main drawback is that it does not handle partial observability or complex uncertainty as effectively as POMDPs.

3.5.6 Generative AI decision-making

Generative AI-based decision-making leverages LLMs or VLMs to generate or refine robot actions. The system can dynamically generate responses and action plans by querying a pretrained GPT-like model with a prompt such as “Given the environment state, what is the next best action?“ (Menezes, 2024).’s MuModaR framework integrates multimodal inputs using GIT and GPT-4 for cross-modal alignment of visual, textual, and auditory inputs, enabling real-time feedback-driven decision-making (Ly et al., 2024). Employs an LLM-based planner to integrate recognition results with motion feasibility, allowing a mobile manipulator (Toyota HSR) to generate action sequences based on user commands (Forlini et al., 2024). Uses GPT-4V to process visual and context-aware inputs, enabling accurate identification of components and their assembly states (Song D. et al., 2024). Developed a socially aware robot navigation system leveraging a VLM-based approach (GPT-4V) to allow adaptive and socially aware decision-making (e.g., recognizing a stop gesture). The advantages of this approach include great flexibility and potential zero/few-shot learning capabilities. While large language models are often perceived as less interpretable than classical rule-based systems, recent techniques such as chain-of-thought (Lu et al., 2024) prompting can improve their transparency and reasoning interpretability. Furthermore, although these models can pose computational challenges in latency-sensitive scenarios, real-time constraints may not be critical in many offline decision-making contexts.

3.5.7 Hybrid approaches

Hybrid Approaches combine two or more methods above—for instance, using symbolic rules with a deep RL agent, or using MDP planning plus an LLM to handle high-level language instruction. This method can exploit complementary strengths, e.g., robust uncertainty handling with interpretability. For example (Zhang X. et al., 2024), developed a robot planning system for open-world environments, leveraging a pre-trained VLM and Planning Domain Definition Language (PDDL) to generate actions. When an action fails (e.g., grasping a cup fails or an object drops), the system updates the world state and replans accordingly. However, integration will be complex.

4 Discussion

Building on the previous chapters, we established the necessity of multimodality, highlighted the advantages of multimodal perception in dynamic environments, and summarised how multiple modalities can be fused and integrated into subsequent decision-making processes. We also systematically reviewed the major decision-making methodologies for MPDDM commonly employed in HRI. This section will revisit these findings from a broader perspective, focusing on the current challenges and limitations of existing systems and research efforts. Specifically, we will examine three key areas: technical integration and sensor noise, domain generalization, and safety and robustness. Finally, based on these observations, we propose several future research directions that could guide subsequent investigations and applications in this evolving field.

4.1 Current challenges and limitations

4.1.1 Technical integration and sensor noise

In multimodal perception-driven decision-making (MPDDM) systems, the technical integration of sensors and computational modules remains a significant challenge. First, the need to fuse and align data from multiple modalities (e.g., vision, LiDAR, audio) can introduce high computational complexity—the system must handle large-scale data (see (Zhang X. et al., 2023; Al-Qaderi and Rad, 2018a; Zhang Y. et al., 2024; Vauf et al., 2016; Granata et al., 2012; Tang et al., 2015; Sha, 2024; Ji et al., 2020; Wang, 2023; Belcamino et al., 2024)) while ensuring real-time performance. Studies indicate that unimodal perception (using only LiDAR or RGB cameras) is less effective in socially rich or dynamic HRI scenarios (Panigrahi et al., 2023; Zhang Y. et al., 2024; Amiri et al., 2020), and highlight the necessity of incorporating multiple sensors (Cai et al., 2024) for robust situational awareness. However, integrating multiple modalities demands careful calibration, time-stamping, and data synchronization (Sha, 2024; Forlini et al., 2024).

A second challenge relates to computational complexity and real-time constraints. As multiple modalities scale, so do the demands on both memory and processing power, especially when advanced deep learning is used for sensor fusion (Dağlarlı, 2020; Amiri et al., 2018; Granata et al., 2012). This inevitably leads to a significant issue—the need to sacrifice some degree of precision. This trade-off is one of the key reasons why some studies argue that it is difficult to achieve an accurate representation of the real world. For instance, systems that combine raw images, depth maps, and social signals (e.g., speech or gesture data) can overwhelm onboard hardware if not carefully designed (Sha, 2024; Forlini et al., 2024; Ferreira et al., 2012). Consequently, many works struggle to maintain low computational overhead while preserving runtime flexibility and robust decision-making (Forlini et al., 2024; Khandelwal et al., 2017; Zhang et al., 2021; Li et al., 2021). In addition, some works claimed that handling partial observability (e.g., in a mixed-observability MDP) further intensifies the complexity (Amiri et al., 2018; Ji et al., 2020).

Finally, sensor noise and environmental uncertainties remain a pervasive obstacle to reliable MPDDM. Vision modules may suffer inaccuracies from changing illumination or strong reflections, and LiDAR scans can be corrupted by cluttered or reflective surfaces (Zhang Y. et al., 2024; Sha, 2024; Scicluna et al., 2024). Tactile or audio channels can face similar distortions when interacting closely with humans (e.g., voice overlapping in a crowded environment (Zhang Z. et al., 2023; Qin et al., 2023), or haptic signals drowned by mechanical vibration (Forlini et al., 2024)). While some systems attempt to incorporate uncertainty modeling or real-time sensor re-calibration (Li et al., 2021; Al-Qaderi and Rad, 2018b), guaranteeing seamless operation in the presence of sensor noise and incomplete data still remains an open technical challenge.

4.1.2 Domain generalization

Another critical issue for MPDDM in HRI is domain generalization, i.e., whether a trained or engineered system can maintain effectiveness when deployed in new tasks or different application contexts (Zhang X. et al., 2023; Wang, 2023; Baptista et al., 2024; Zhou and Wachs, 2019; Zhang X. et al., 2023). For example, in personal-assistive robots, user demographics and cultural factors significantly affect language or gesture recognition modules (Aly, 2014). Systems that are meticulously tuned to one environment or set of objects often fail to generalize in a new industrial or social setting (Tang et al., 2015; Amiri et al., 2018), leading to increased development costs each time the context changes.

4.1.3 Adaptation

Adaptation is very important in HRI, Despite promising methods for continual learning, many HRI systems still exhibit limited adaptation to changes in user needs or environmental conditions. Trained models may fail when confronted with new geometries, lighting setups, or people’s behaviors (Amiri et al., 2020; Baptista et al., 2024; Churamani et al., 2020; Yuan et al., 2024). Beyond physical changes, the social nature of HRI demands that systems also account for shifting user preferences, habits, cultural norms, and collaborative task requirements, which can evolve over time (Zhang Y. et al., 2024; Granata et al., 2012). Therefore, models optimized for a single/short environment or user profile often become inadequate once real-world conditions diverge from those observed during training. Addressing this challenge calls for robust continual learning strategies that can fuse on-the-fly sensor data with real-time learning/inference while preserving previously acquired knowledge. Developing such flexible adaptation mechanisms remains a key research direction and challenge.

4.1.4 Safety and robustness

Finally, ensuring safety and robustness in real-world HRI scenarios is paramount. Many MPDDM systems must handle close-range human interaction, often in dynamic, unpredictable environments (Menezes, 2024; Granata et al., 2012; Yas et al., 2024). Sensing inaccuracies (e.g., uncertain human motion trajectories or ambiguous gestures) amplify the difficulty of guaranteeing safe robot operation (Tang et al., 2015), particularly when the robot must execute complex manipulation or navigation tasks (Forlini et al., 2024). Although advanced approaches leverage multi-layer sensor fusion and failover mechanisms, long-term deployment can still face drift and sensor misalignment (Menezes, 2024; Granata et al., 2012), leading to cumulative errors over time.

Moreover, real-time failure recovery is frequently overlooked. Some strategies perform global re-planning upon any anomaly, but this can be computationally expensive or slow (Ly et al., 2024). Other works rely on users to intervene manually. The challenge is thus twofold: designing motion-level re-planning or fallback strategies without incurring excessive latency, and building a dialogue or feedback loop allowing humans to provide corrective input (Menezes, 2024; Ly et al., 2024). All these works aim to create a system that not only meets the real-time demands of dynamic HRI but also maintains robust performance and safe collaborative interactions.

4.2 Future research directions

4.2.1 Advancing learning-based approaches

Future work would underscore the need for more efficient learning paradigms—ranging from generative AI to reinforcement learning (RL)—to cope with multimodal perception in dynamic HRI. Several works point to semi-supervised or weakly supervised techniques that can reduce reliance on extensive labeled data, thus lowering the costs of large-scale multimodal curation (Cai et al., 2024; Panigrahi et al., 2023; Zhang X. et al., 2023; Sha, 2024). Furthermore, robust generative models could help unify multiple input streams (e.g., vision, audio, haptics) while automatically aligning them with latent representations (Dağlarlı, 2020; Menezes, 2024; Vauf et al., 2016). Equally important is the push to incorporate advanced attention mechanisms and semantic reasoning for better capturing cross-modal signals (Dağlarlı, 2020; Ji et al., 2020; Yas et al., 2024; Aly, 2014), leading to more context-aware and “cognitive” HRI systems. In parallel, scaling up sensor coverage while keeping memory overhead tractable remains a challenge that future work must address by optimizing sensor fusion and feature extraction (Vauf et al., 2016; Belcamino et al., 2024; Scicluna et al., 2024).

4.2.2 Improving explainability and human trust

Ensuring that AI-driven decisions are interpretable is vital for fostering user acceptance and trust in HRI (Mathur et al., 2025). Although many multimodal models yield high accuracy, they often behave as “black boxes,” making it unclear why a system chooses a particular action or how it handles ambiguous inputs (Tang et al., 2015; Qin et al., 2023). Work in (Ch et al., 2022; Li et al., 2021; Zhang Z. et al., 2023) highlights the importance of building affective and dialogue-based interactions with the user, so the robot can clarify uncertainties or explain the reasoning behind certain decisions. Additionally, to promote safer collaboration (Ly et al., 2024; Lu et al., 2023), propose integrated motion-level re-planning frameworks that combine transparency about failure causes (e.g., “object not in view”) with real-time user feedback. Future research might incorporate social cues such as facial expressions, body posture, or personal preferences (Vauf et al., 2016; Churamani et al., 2020; Al-Qaderi and Rad, 2018b; Ferreira et al., 2012; Yuan et al., 2024), improving the system’s ability to provide human-readable justifications and adapt its behavior accordingly. Ultimately, bridging model decisions and intuitive explanations can drive deeper user trust in situations that demand joint decision-making.

4.2.3 Scalable multi-robot collaboration

Another promising direction concerns scalable multi-robot systems, where tasks span collaborative assembly, multi-robot coordination, or large-scale monitoring (Khandelwal et al., 2017; Zhang et al., 2021). While single-robot multimodal perception has progressed substantially, simultaneously coordinating multiple robots under uncertain or partially observable conditions remains underexplored. Key open questions center on robust joint perception—sharing or transferring learned policies, sensorimotor features, and knowledge across heterogeneous platforms (Zhang et al., 2021). In parallel, the complexities of real-world scheduling, path planning, and dynamic role assignment amplify in multi-robot teams, as partial failures in one platform can cascade. Interweaving user interactions—e.g., a human operator or supervisor who provides on-demand clarifications—poses further integration challenges (Khandelwal et al., 2017). Addressing these issues could enable more flexible, self-organized teams of robots that better adapt to large-scale tasks and diverse users.

4.2.4 Long-term autonomy and continual learning

Finally, long-term autonomy in dynamic human environments demands that a robot continuously refine its models and maintain stable performance over lengthy deployments (Zhang Y. et al., 2024; Granata et al., 2012; Amiri et al., 2018; Forlini et al., 2024). Systems must confront persistent changes in environment geometry, lighting conditions, or occupant behavior, which can degrade originally trained models (Amiri et al., 2018). Continual learning approaches that leverage streaming sensor data could keep the robot’s perception and action policies up-to-date—though care must be taken to avoid catastrophic forgetting (Forlini et al., 2024). Equally important is capturing evolving user preferences, social context, and task requirements (Zhang Y. et al., 2024; Granata et al., 2012). Achieving robust online updates for these modules will demand balancing data quality (potentially from incomplete or noisy real-world streams) with computational efficiency, as pointed out by (Amiri et al., 2018) and (Forlini et al., 2024). Future work may combine online transfer learning, policy gradient RL, and environment mapping methods to sustain consistent performance in long-duration, continuously changing settings.

Overall, addressing these four broad directions—advanced learning paradigms, intuitive human trust, multi-robot scaling collaboration, and long-term autonomy—holds the potential to push MPDDM-based HRI toward more natural, capable, adaptable, and user-aligned HRI systems in real-world practice.

5 Conclusion

In this survey, we have examined how multimodal perception can enrich decision-making in human-robot interaction from several application perspectives, thereby demonstrating the importance of multimodality in improving decision-making. By synthesizing insights from existing literature, we showed that leveraging multiple sensory modalities not only increases robustness against sensor failures and environmental uncertainties but also provides a richer context for understanding human states and intentions. Consequently, effectively fusing different data modalities into decision models that handle partial observability, real-time constraints, and evolving user behavior has emerged as a critical direction in achieving natural, robust, and safe human-robot interaction in the future.

Despite these promising developments, several challenges remain. Real-world deployments still grapple with sensor noise, synchronization overhead, and the substantial computational burden of processing large-scale multimodal data in real time. Moreover, generalizing systems beyond controlled laboratory conditions poses considerable difficulties—especially when robots operate in diverse settings with varied user profiles, tasks, and cultural norms. Safety and trustworthiness also demand deeper investigation; while fusion-based models achieve higher accuracy, they can be opaque, making it difficult for end users/researchers to understand how a robot arrives at particular choices.

Looking ahead, the ongoing progress of learning-based methods and large-scale foundational models is poised to broaden the horizons of what multimodal perception and decision-making can accomplish. By striking a careful balance among computational efficiency, explainability, and responsiveness, future research can produce truly adaptive, socially aware robots that seamlessly integrate into daily life. Ultimately, overcoming these human-centered challenges will bring us closer to robots capable of robustly perceiving complex scenarios, inferring user intentions and needs, and collaborating safely and intelligently across a wide range of domains.

Author contributions

WZ: Conceptualization, Data curation, Formal Analysis, Investigation, Methodology, Project administration, Resources, Visualization, Writing – original draft, Writing – review and editing. KG: Conceptualization, Data curation, Formal Analysis, Investigation, Methodology, Writing – review and editing. FY: Conceptualization, Data curation, Investigation, Methodology, Project administration, Supervision, Visualization, Writing – original draft, Writing – review and editing.

Funding

The author(s) declare that no financial support was received for the research and/or publication of this article.

Acknowledgments

The authors acknowledge the use of ChatGPT for editing and polishing the manuscript.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that Generative AI was used in the creation of this manuscript. ChatGPT was used for editing the authors, own text to polish the manuscript (improve the readability).

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Al-Qaderi, M. K., and Rad, A. B. (2018a). A brain-inspired multi-modal perceptual system for social robots: an experimental realization. IEEE Access 6, 35402–35424. doi:10.1109/access.2018.2851841