ORIGINAL RESEARCH article

Front. Comput. Sci.

Sec. Computer Vision

Volume 7 - 2025 | doi: 10.3389/fcomp.2025.1576775

Convolutional Spatio-Temporal Sequential Inference Model for Human Interaction Behavior Recognition

Provisionally accepted
Lizhong  JinLizhong Jin*Rulong  FanRulong FanXiaoling  HanXiaoling HanXueying  CuiXueying Cui
  • Taiyuan University of Science and Technology, Taiyuan, China

The final, formatted version of the article will be published soon.

Research on human action recognition holds significant theoretical and practical importance. It encompasses multiple disciplines, including computer vision, pattern recognition, and machine learning, making it a complex and dynamic challenge to accurately model and recognize human actions. However, research in human interaction behavior recognition faces several challenges, such as correctly modeling human behaviors and distinguishing the sequence of interactions between parties. Current techniques mainly include skeleton sequence-based and RGB videobased recognition models. While recent methods like ActCLR and AutoGCN achieve high accuracy, they often require extensive computational resources. This paper focuses on improving RGB video-based models to enable effective fusion with skeleton point-based modalities. In the fusion of RGB video-based recognition models with skeleton point-based recognition models, we address the challenges of multi-modal data fusion and the excessive parameter size of temporal sequence models. We propose a Convolutional Spatiotemporal Sequence Inference Model (CSSIModel) for recognizing human interaction behaviors. The CSSIModel leverages DINet for extracting features from skeleton data and ResNet-18 for RGB frames. Subsequently, it fuses these features and uses a multi-scale two-dimensional convolutional peak-valley inference module to classify the human behavior categories from the fused features. Experimental results demonstrate that CSSIModel achieves 87.4% accuracy on NTU RGB+D 60 (XSub), 94.1% on NTU RGB+D 60 (XView), 80.5% on NTU RGB+D 120 (XSub), and 84.9% on NTU RGB+D 120 (XSet). These results are competitive with or superior to state-of-the-art models, highlighting the model's strong generalization performance, especially on large-scale datasets. Our method balances accuracy and efficiency, achieving competitive performance while being significantly more lightweight than existing approaches, making it suitable for real-time applications. This method is lightweight, effective, and provides a valuable reference for future research in human behavior recognition.

Keywords: human behavior recognition, deep learning, multimodal learning, skeleton point sequence information, time series recognition, inference mode

Received: 14 Feb 2025; Accepted: 16 Jun 2025.

Copyright: © 2025 Jin, Fan, Han and Cui. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Lizhong Jin, Taiyuan University of Science and Technology, Taiyuan, China

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.