An initial prediction and fine-tuning model based on improving GCN for 3D human motion prediction

Human motion prediction is one of the fundamental studies of computer vision. Much work based on deep learning has shown impressive performance for it in recent years. However, long-term prediction and human skeletal deformation are still challenging tasks for human motion prediction. For accurate prediction, this paper proposes a GCN-based two-stage prediction method. We train a prediction model in the first stage. Using multiple cascaded spatial attention graph convolution layers (SAGCL) to extract features, the prediction model generates an initial motion sequence of future actions based on the observed pose. Since the initial pose generated in the first stage often deviates from natural human body motion, such as a motion sequence in which the length of a bone is changed. So the task of the second stage is to fine-tune the predicted pose and make it closer to natural motion. We present a fine-tuning model including multiple cascaded causally temporal-graph convolution layers (CT-GCL). We apply the spatial coordinate error of joints and bone length error as loss functions to train the fine-tuning model. We validate our model on Human3.6m and CMU-MoCap datasets. Extensive experiments show that the two-stage prediction method outperforms state-of-the-art methods. The limitations of proposed methods are discussed as well, hoping to make a breakthrough in future exploration.

. Introduction 3D skeleton-based human motion prediction uses an action posture observed in the past to predict an action posture in the future. Motion prediction technology helps robots understand human behavior. This technology is of great value in areas such as intelligent security, autonomous driving (Ge et al., 2019;Djuric et al., 2020;Gao et al., 2020), object tracking, and human-robot collaboration (Liu and Wang, 2017;Oguz et al., 2017;Liu et al., 2019aLiu et al., , 2021Li et al., 2020b;Liu and Liu, 2020;Ding et al., 2021;Mao et al., 2021).
Recurrent neural networks(RNN) are usually adopted to solve sequence-to-sequence prediction tasks, such as voice recognition and automatic translation Tang R. et al., 2018;Iida et al., 2019;Yao et al., 2021). Due to the sequential nature of motion in the time dimension, many works use RNN to realize human motion prediction (Fragkiadaki et al., 2015;Chiu et al., 2019;Guo and Choi, 2019;Corona et al., 2020). However, RNNbased networks are usually difficult to train and have the problem of error accumulation in long-term predictions. There are a few works adopting convolution networks (CN) to solve . /fncom. . the problem of human behavior prediction (Butepage et al., 2017;Li et al., 2018;Shu et al., 2021). They process the human motion sequences as images and use 2D convolution to generate the prediction sequence. Nevertheless, human motion sequences are not traditional image data, and traditional convolution neural networks are limited in processing such sequences. In recent years, lots of work uses graph convolution networks (GCN) to solve human motion prediction tasks and achieved excellent results (Aksan et al., 2019;Cui et al., 2020;Dang et al., 2021). GCN is similar to CNN except that it performs feature extraction on graphs. GCN usually defines an adjacency matrix in advance, representing the interconnection relationship between each node in the graph. Then GCN generates new node information by aggregating related node information. In addition, by aggregating action information efficiently, recent work (Mao et al., 2019) confirms that the discrete cosine transform (DCT) has great advantages in motion prediction. Taking advantage of GCN, Many works have achieved good performance, but they also expose some shortcomings of GCN. For example, many GCN-based methods convert joint information to the frequency domain for prediction and recover the time domain information, causing the joint position generated to not smooth on the time domain. In addition, many works have changed the bone length of the human body, causing deformation.
Inspired by the proposed concept of two stages prediction (Shan et al., 2022), we present a two stages framework to solve the above problems, including the prediction stage and finetuning stage, to achieve precise prediction of human motion sequences. The task of the prediction stage is to use DCT to encode the motion information and then use the attention mechanism to calculate the attention score to strengthen the interaction of each node. Then we use IDCT (inverse discrete cosine transform) to decode the aggregated features into the original 3D pose, generating the initial prediction for the first stage.
We observe that the initial prediction always has a certain deviation from the ground truth. To solve this problem, we organize a fine-tuning model to correct the initial predictions of the first stage. Observing that the actors for different actions in the datasets are the same, each frame in the action sequence should contain the same body structure information, such as the length of each bone. We add a bone length constraint term in the loss function of the fine-tuning model. Since the motion sequences generated by the frequency domain are not coherent in the time domain, the traditional TCN method uses a global adjacency matrix to aggregate sequence information, which often makes predicted actions deviate from reality. In response to this problem, we propose a CMM (causal mask matrix) to improve the T-GCN and fine-tune the initial prediction, making each frame future sequence generated only related to its previous information, which eliminates the effect of future inaccurate information when constructing the current frame.
We used MPJPE as metrics to evaluate our network on the Human3.6m and CMU-MoCap, and conducted related ablation experiments to analyze our key models. Many comparative experiments show that our method achieves more accurate predictions than the existing approaches.
In summary, the main contributions of this paper can be concluded as follows: • We propose a two-stage training method, including prediction and fine-tuning stages. Fine-tuning stage corrects the human motion sequences generated by the prediction stage. • To further utilize the interactive information on the temporal structure of human motion, we present a CMM improving the T-GCN in the fine-tuning stage to reconstruct the sequence in a causal, temporal order. • In order to improve the power of GCN to extract the spatial interaction information of the human, we introduce a SAB (spatially attention block) to aggregate node information along the spatial dimension. Moreover, we incorporate the constraint of length invariance of human bones for guiding the framework to generate more realistic human motion sequences.
. Related work . . RNN-based method RNN-based methods are widely used for sequence-to-sequence tasks (Jain et al., 2016b;Martinez et al., 2017;Liu et al., 2019b;Sang et al., 2020). According to the characteristics of human motion sequence, a lot of works use RNN as the basic structure of the network. By embedding encoder and decoder networks before and after recurrent layers, Fragkiadaki et al. (2015) propose an Encoder-Recurrent-Decoder (ERD) model for predicting human motion. Jain et al. (2016a) combine RNNs with the spatiotemporal structure of the human body, proposing the Structural-RNN. Liu et al. (2019b) develop a hierarchical recurrent network structure to simultaneously encode the local context of a single frame and the global context of a sequence.
However, RNN combines the hidden layer of the previous unit to output the prediction of the next unit, which will cause the accumulation of errors. These methods cannot avoid the error accumulation problem. Error accumulation causes discontinuities in generated frames, resulting in unrealistic human motion sequences. Gui et al. (2018) propose a novel sequence-to-sequence model that adds a residual architecture connection between the input and output of each RNN module, which alleviates the discontinuity problem of the RNN model. Guo and Choi (2019) modified the seq2seq framework to encode temporal correlations at different time scales. Shu et al. (2021) designed a new bonejoint attention mechanism to dynamically learn the bone-joint feature map of the bone-joint attention feature map, making the generated action sequences closer to reality. Although these methods effectively improve the accuracy of prediction, their performance in long-term prediction is still insufficient.

. . GCN-based method
Compared with traditional CNN-based methods, the GCNbased method has significant advantages in the face of irregular data structures, such as social networks (Tabassum et al., 2018;Li et al., . /fncom. . 2020a) and human body posture and behavior (Fan et al., 2019;Chen et al., 2020). In recent years, graph neural networks have been widely used for 3D human motion prediction and have achieved outstanding results. Lebailly et al. (2020) use GCN as an encoder and GCN to decode the aggregated features. The works of Mao et al. (2019), Cui et al. (2020), and Dang et al. (2021) totally used GCN to organize the model. Li et al. (2020b) encode the human body into multiple scales and perform information fusion, proposing the DMGNN. Ma et al. (2022) used a spatiotemporal GCN to build a model to obtain more accurate long-term predictions by predicting the median value of human motion. Mao et al. (2019) used discrete cosine transform to encode human motion sequences and designed a GCN-based model that automatically learns node relationships. Although these GCN-based methods have achieved good results, these methods still have not solved the following problems: The method predicts in the frequency domain often cannot pay attention to the time dependence of the original information. The future human motion sequence generated by these methods does not follow the bone constraints. In other words, the length of the human bone skeletal generated has changed. In order to solve the time dependence of frequency domain prediction methods, we propose a two-stage network architecture. The first stage is initially predicted in the frequency domain, and the second stage fine-tunes the initial result in the time domain. In the second stage, we use CMM to change the adjacent matrix into causality. In order to make the prediction results follow the bone constraints, we proposed the SAB to enhance the ability to capture the Spatial interaction relationship of the joint and increase all bone length as a constraint to train the model. We introduce the details of our framework architecture in the following section.

. Problem formulation
Suppose that X −T p : 0 = X −T p , . . . , X 0 denotes the historical human motion sequence of length T p + 1 and X 1 : T f = X 1 , . . . , X T f denotes the future sequence of length T f , where X i ∈ R N×D with N joints and D = 3 feature-dimensions depicts the 3D human pose at time i. The task of 3D human motion prediction is to generate the future sequence X 1 : T f given the historical one.
For predicting complex human motion more accurately, we use a two-stage prediction method based on GCN. We use cascading SAGCLs to predict the results of the first stage. Then, the initial prediction is fine-tuned by using the space-time constraints of the human body to get the second stage prediction that is closer to real human movement.

. Methodology . . Prediction and fine-tuning framework
In order to predict future motion sequences precisely, we adopt a two-stage training method, as shown in Figure 1. According to the human motion of T frames observed in the past X − = [X −T , X −T+1 , . . . , X −1 ], we first apply DCT along the time dimension to convert the temporal dynamics of motions into the frequency domain.
The encoder in the predicted model uses the node information in the frequency domain of SAGCL and SAGCB (Spatial Attention Graph Convolution Block), and then generates the predictive information of the frequency domain through the same structured decoder. Then the prediction model will use IDCT to restore the predicted joint information to the time domain. As shown in Figure 1, the red skeleton in the middle denotes the initial prediction of the first stage. We use the joint position errors as constraints to train the prediction model. Meanwhile, the skeleton is marked as red, representing the problem of discontinuity and skeletal deformation.
We reduce the impact of these problems in the second stage. CT-GCL only predicts depending on the past sequence. And we consider the joint position and the bone length constraint to train the fine-tuning model. The second stage corrects the bone length and keeps temporal dependence, which makes the prediction closer to natural human motion.

. . Prediction model (frequency domain)
Based on S-GCN (Spatial-Graph Convolution Layer) and T-GCN (temporal-Graph Convolution Layer), we build an encoderdecoder human motion prediction model. Both the encoder and decoder contain a SAGCL and a SAGCB. Each SAGCB includes six SAGCLs. The structure of SAGCL is shown in Figure 2. When the motion information flows through SAGCL, SAB will first extract the interaction information between human joints based on the motion information. In SAB, we use the average pooling layer to aggregate human body node interactive information along the spatial dimension. SAB can calculate the gating weight value of each node in the interval 0 − 1 according to the Sigmoid function and finally aggregate the information between dependent nodes according to the dynamic joint weight. The weight matrix of SAB can be calculated by the following formula: Where H represents the hidden feature. W and b are the Parameter matrix and bias vector of FC layer, respectively. AvgPool denotes the average pooling along the temporal dimension. The Sigmoid function calculates the jointswise 0 − 1 gating weights.
Then S-GCN aggregates interaction information along the spatial dimensions. Let X ∈ R L×M×F be a pose sequence where L is the length of the sequence, M is the number of joints of a pose, and F indicates the number of features of a joint. Defining a learnable adjacency matrix A s ∈ R M×M the elements of which measure relationships between pairs of joints of a pose, S-GCN work as: Where l denotes the parameter in l th layer. σ represents the Leaky ReLU. T-GCN aggregates interaction information along the temporal dimensions. Defining a learnable adjacency matrix A t ∈ R L×L measuring weights between pairs of joints of a trajectory, T-GCN computes: Frontiers in Computational Neuroscience frontiersin.org He et al.

FIGURE
Overview of our prediction and fine-tuning human motion prediction framework containing a prediction model and a fine-tuning model. SAGCL denotes the spatial attention graph convolution layer, and the SAGCB consists of cascade SAGCL. Black skeleton represents group truth. The red skeleton represents the prediction of the first stage, and the green skeleton represents the result of fine-tuning.

FIGURE
The structure of SAGCL. SAB denotes spatial attention block. S-GCN and T-GCN indicates the Spatial and temporal GCN, respectively. BN indicates batch normalize operation.

. . Fine-tuning model (time domain)
Previous work based on graph convolution often focuses on global historical information through the temporal adjacency matrix when generating future action sequences. But the results predicted by the frequency domain are only sometimes smooth in time series. Using the unsmoothed global history motion information tends to corrupt the future sequence generated by the network. Therefore, we build a Fine-tuning model based on a cascade CT-GCL to reconstruct all the future sequences in the time domain. The input of the fine-tuning model is the complete output of the prediction model. The output of the fine-tuning model is a new sequence adjusted by constraints. As shown in Figure 3, each CT-GCL is mainly established by the S-GCN, CMM, and T-GCN, creating a new mapping between the temporal independent motion sequence and the temporal causal sequence.
CMM adjusts the node position predicted in the first stage in the temporal series so that the node position at each moment is only related to the previous time. As shown in Figure 4, CMM is initialized as an upper triangular matrix and makes a Hadamard product with the adjacency matrix in the temporal dimension.  So only the information at the current moment and before is aggregated when CT-GCL reconstructs the future motion sequence.

. . Loss function
For training the prediction network, we consider L 2 loss function for 3D joint positions. Suppose that the prediction sample isχ , and the corresponding ground truth value is χ . For T training samples and K nodes, the loss function is: . /fncom. .
Where p k t denotes the ground truth position of k-th joint in frame t andp k t denotes the predicted one. We adopts L2 loss function for 3D joint positions and the length of each bone in a human body to train the fine-tuning network. The loss function is: Where b m t denotes the length of the M th bone in the human body, andb m t denotes the length of the M th bones in frame t. p k t andp k t are the same as in formula (4). By correcting the length of the bones in each frame of the predicted sequence, the fine-tuning network makes the reconstructing sequence closer to the actual value.

. Experiments
We used Human 3.6m (Ionescu et al., 2013) and CMU-MoCap dataset to validate our framework. The joint data for both datasets are represented by an exponential map. In this work, We convert it to a 3D coordinate representation. Furthermore, we show the quantitative results for both short-term and long-term human motion predictions for joint positions by Mean Per Joint Position Error (MPJPE).

. . Datasets . . . CMU−MoCap
CMU-MoCap has 5 main categories of motion data. In line with previous work (Mao et al., 2019;Dang et al., 2021), we selected eight actions to validate our framework: "basketball", "basketball signal", "directing traffic", "jumping", "running", "soccer", "walking", and "washing window". Each motion data contains 38 joints (contains repeated joints), and we preserve 25 valuable joints. The division of the training set and test set also remains the same as Mao et al.

. . . Human . m
Human3.6m has 15 different classes of motion performed by 7 actors. Each motion in subjects contains 32 joints, and we preserve 22 joints. To be consistent with Li et al. (2020b), we train the model on 6 subjects and test it on the 5th subject. To be consistent with previous work (Dang et al., 2021), we use S1, S6, S7, S8, and S9 for training and use S5 and S11 for testing and validation, respectively.

. . Metrics
In this paper, we train and test the 3D coordinates coordinate representation of the human pose and show the measurement results in 3D coordinates. Defining the prediction sample isX, and the corresponding ground truth value is X. We use mean per-joint position error (MPJPE) as an evaluation metric for 3D error: Where p n t represents the n th ground truth joint position in t th frame. Andp n t denotes the predictive one.

. . Model configuration
There are different cross-validation methods, such as k-fold cross-validation and jack-knife test, which have been generally used to train the model (Arif et al., 2021;Ge et al., 2021Ge et al., , 2022aSikander et al., 2022). We trained our proposed model using a 10-fold cross-validation method. Our network predicts the human pose of 25 frames in the future by observing the position of the joints in the past 10 frames. Each SAGCB in the prediction model contains 6 SAGCLs. After testing, we cascaded 6 CT-GCLs in the fine-tuned model. We utilize Adam as an optimizer. The learning rate is initialized to 0.005 with a 0.96 decay every epoch. Both the prediction model and the fine-tuning model are trained for 50 epochs, and the batch size is set to 32. We implemented our network on GeForce RTX 2080 Ti GPU using Pytorch.

. . Comparison to state-of-the-art methods
We validated our model on Human3.6m and CMU-MoCap datasets and present detailed results. We compared our method with DMGNN (Li et al., 2020b), LTD (Mao et al., 2019), MSR (Dang et al., 2021), and ST-DGCN . DMGNN uses GCN to extract features of multiple scales of the human body and uses graph-based GRU for decoding. Applying DCT, LTD uses GCN for prediction in the frequency domain. MSR is improved on the basis of LTD, taking into account multi-scale factors of the human body.
. . . Human . m As seen in Table 1, we compared the several methods mentioned above in short-term prediction (within 400 ms) on Human3.6m. The results show that our method outperforms previous methods in short-term prediction. For example, our method has a significant advantage in the action "greeting", "posing", and "sit down". Our method is greatly reduced in the short -term prediction of many actions such as "walkingdog", "greeting", and "discussion", which performance is more than 10% higher than the best method in 80 ms. And Table 2 shows the comparisons of long-term prediction (between 400 and 1,000 ms). In most cases, our results are better than the compared methods. For example, the performance of our method in motion "discussion" and "posing" is about 3% higher than the best method in 1,000 ms and motion "walkingtogether" is more than 5% higher, which shows that our method also performs well in the longest prediction. According to the average errors for short-term and long-term prediction, our method outperforms the compared methods by a large margin.  The best results are highlighted in bold.   The best results are highlighted in bold. We check the output of the prediction model only. Compare the full model and the full model without the key model we presented such as SAB, CMM, and the bone length constraint. The best results are highlighted in bold. After the experiment, we finally set the test range at 4-7. The best results are highlighted in bold. . . . CMU − MoCap Table 3 shows the comparisons of average value on CMU-MoCap. Our method significantly outperforms the comparison methods in both short-term and long-term prediction. The error of our method is reduced by nearly 10% compared with ST-DGCN in 1,000 ms prediction.

. . Ablation study
To further analyze our model, we performed the following ablation studies on Human3.6m. We conduct the following comparative experiments to analyze the impact of each module of our model.
As shown in Table 4, we tested the performance of the predictive model alone using the ground truth to evaluate the effect of the fine-tuned model. In the case of only the forecasting model, both long-term and short-term predictions have performance degradation. The average prediction error went from 53.71 to 55.08. Experiments show that the fine-tuning module adjusts the initial prediction by time dependence and bone constraint, making the predicted motion sequence closer to the actual value. We also tested the impact of several key modules in the framework, such as SAB, CMM, and bone length loss function. The average error of melting the above key modules has risen to 54.31, 54.13, and 54.53, respectively.
Our results show that the bone length loss function has the most significant impact on the model, which verifies the problem of the GCN-based method in predicting the deformation of the human body. Our method uses SAB and bone constraints to strengthen the extraction of bone information by GCN layers, making our results better than the current method. The CMM module has also played .
/fncom. . a positive role in fine-tuning modules to avoid discontinuous information in the future to destroy the aggregation of temporal information in T-GCN. As shown in Table 5, keeping the output of the prediction model constant, we conduct ablations about the cascaded layers m from 4 to 7 in the fine-tuning model. The results show that the fine-tuned model has the best performance when m = 6. Table 6 shows the comparison of different numbers of SAGCLs in a SAGCB. We regard the prediction model and fine-tuning model as two independent modules. And we only consider the results of the prediction model in this experiment. The results showed that five SAGCLs achieved the most accurate results at 80 ms. However, in the long-term error comparison, six SAGCLs have more advantages. Considering the average error, we use six SAGCLs to obtain the most accurate initial prediction.

. Discussion
In the previous section, we compared our method with state of the art. Using SAB and bone length constraint, our method has a strong spatial interactive relationship capture ability. Thus, our method has a significant advantage in some motions with large movements, such as a result of action "walking dogs" and "sitting down" in Tables 1, 2. As shown in Table 3, using a finetuning model to adjust the initial prediction in the time domain. Our model has a strong ability to capture time dependence so that the performance of the model is more than 10% compared with the latest method by 1,000 ms. As shown in Table 4, CMM also enhances this performance.
On the other hand, there are also some shortcomings in our model. Our model is based on the two-stage training method. We need to pre-training a prediction model and then train a finetuning model, which undoubtedly increases our training time and complexity. What is more, our fine-tuning model is largely limited by predictive models, which means that the correction capacity of fine-tuning models is limited. We still do not do well in long time prediction. As shown in Table 2, we still have a lot of room for improvement in long-term predictions.

. Conclusion
We propose a two-stage forecasting framework, including prediction and fine-tuning models. In the prediction model, we first transform the observed pose data into the frequency domain using DCT. Before the transformed pose data flows through the GCN, the interaction information between joints is enhanced by the spatial attention mechanism. Then we use IDCT to restore the generated future poses to the time domain. In the second stage, we add the bone length error as a loss function to train the fine-tuning model better, which makes the corrected pose sequence closer to the natural human motion. What is more, we use CMM to improve the T-GCN in the fine-tuning model, making the regenerated motion sequences more coherent on the timeline. Extensive experiments show that fine-tuning the model plays a positive role in improving the results of the predictive model. Our work outperforms previous work on commonly used datasets.

Data availability statement
Publicly available datasets were analyzed in this study. This data can be found here: http://mocap.cs.cmu.edu/.

Author contributions
LZ proposed the two stage training method for human motion production and wrote the manuscript. ZH conducted the literature survey and method guidance. HW analyzed the experiment data and revised the manuscript.

Funding
This work was supported by the National Natural Science Foundation of China under grant 61971290 and the Shenzhen Stability Support General Project (Category A) 20200826104014001.