AUTHOR=Jin Lizhong , Fan Rulong , Han Xiaoling , Cui Xueying 

TITLE=Convolutional spatio-temporal sequential inference model for human interaction behavior recognition

JOURNAL=Frontiers in Computer Science

VOLUME=Volume 7 - 2025

YEAR=2025

URL=https://www.frontiersin.org/journals/computer-science/articles/10.3389/fcomp.2025.1576775

DOI=10.3389/fcomp.2025.1576775

ISSN=2624-9898

ABSTRACT=IntroductionHuman action recognition is a critical task with broad applications and remains a challenging problem due to the complexity of modeling dynamic interactions between individuals. Existing methods, including skeleton sequence-based and RGB video-based models, have achieved impressive accuracy but often suffer from high computational costs and limited effectiveness in modeling human interaction behaviors.MethodsTo address these limitations, we propose a lightweight Convolutional Spatiotemporal Sequence Inference Model (CSSIModel) for recognizing human interaction behaviors. The model extracts features from skeleton sequences using DINet and from RGB video frames using ResNet-18. These multi-modal features are fused and processed using a novel multiscale two-dimensional convolutional peak-valley inference module to classify interaction behaviors.ResultsCSSIModel achieves competitive results across several benchmark datasets: 87.4% accuracy on NTU RGB+D 60 (XSub), 94.1% on NTU RGB+D 60 (XView), 80.5% on NTU RGB+D 120 (XSub), and 84.9% on NTU RGB+D 120 (XSet). These results are comparable to or exceed those of state-of-the-art methods.DiscussionThe proposed method effectively balances accuracy and computational efficiency. By significantly reducing model complexity while maintaining high performance, CSSIModel is well-suited for real-time applications and provides a valuable reference for future research in multi-modal human behavior recognition.