AUTHOR=Shou Zaiyong , Zhu Daoyu 

TITLE=Multi-modal action recognition via advanced image fusion techniques for cyber-physical systems

JOURNAL=Frontiers in Physics

VOLUME=Volume 13 - 2025

YEAR=2025

URL=https://www.frontiersin.org/journals/physics/articles/10.3389/fphy.2025.1576591

DOI=10.3389/fphy.2025.1576591

ISSN=2296-424X

ABSTRACT=IntroductionThe increasing complexity of cyber-physical systems (CPS) demands robust and efficient action recognition frameworks capable of seamlessly integrating multi-modal data. Traditional methods often lack adaptability and perform poorly when integrating diverse information sources, such as spatial and temporal cues from diverse image sources.MethodsTo address these limitations, we propose a novel Multi-Scale Attention-Guided Fusion Network (MSAF-Net), which leverages advanced image fusion techniques to significantly enhance action recognition performance in CPS environments. Our approach capitalizes on multi-scale feature extraction and attention mechanisms to dynamically adjust the contributions from multiple modalities, ensuring optimal preservation of both structural and textural information. Unlike conventional spatial or transform-domain fusion methods, MSAF-Net integrates adaptive weighting schemes and perceptual consistency measures, effectively mitigating challenges such as over-smoothing, noise sensitivity, and poor generalization to unseen scenarios.ResultThe model is designed to handle the dynamic and evolving nature of CPS data, making it particularly suitable for applications such as surveillance, autonomous systems, and human-computer interaction. Extensive experimental evaluations demonstrate that our approach not only outperforms state-of-the-art benchmarks in terms of accuracy and robustness but also exhibits superior scalability across diverse CPS contexts.DiscussionThis work marks a significant advancement in multi-modal action recognition, paving the way for more intelligent, adaptable, and resilient CPS frameworks. MSAF-Net has strong potential for application in medical imaging, particularly in multi-modal diagnostic tasks such as combining MRI, CT, or PET scans to enhance lesion detection and image clarity, which is essential in clinical decision-making.