Full-body pose reconstruction and correction in virtual reality for rehabilitation training

Existing statistical data indicates that an increasing number of people now require rehabilitation to restore compromised physical mobility. During the rehabilitation process, physical therapists evaluate and guide the movements of patients, aiding them in a more effective recovery of rehabilitation and preventing secondary injuries. However, the immutability of mobility and the expensive price of rehabilitation training hinder some patients from timely access to rehabilitation. Utilizing virtual reality for rehabilitation training might offer a potential alleviation to these issues. However, prevalent pose reconstruction algorithms in rehabilitation primarily rely on images, limiting their applicability to virtual reality. Furthermore, existing pose evaluation and correction methods in the field of rehabilitation focus on providing clinical metrics for doctors, and failed to offer patients efficient movement guidance. In this paper, a virtual reality-based rehabilitation training method is proposed. The sparse motion signals from virtual reality devices, specifically head-mounted displays hand controllers, is used to reconstruct full body poses. Subsequently, the reconstructed poses and the standard poses are fed into a natural language processing model, which contrasts the difference between the two poses and provides effective pose correction guidance in the form of natural language. Quantitative and qualitative results indicate that the proposed method can accurately reconstruct full body poses from sparse motion signals in real-time. By referencing standard poses, the model generates professional motion correction guidance text. This approach facilitates virtual reality-based rehabilitation training, reducing the cost of rehabilitation training and enhancing the efficiency of self-rehabilitation training.


Introduction
Existing statistical data indicate that an increasing number of people are now experiencing mobility impairments due to accidents, illness, or aging, thereby demanding the need for rehabilitation (Postolache et al., 2020).Rehabilitation training encompasses a series of intervention exercises aimed at aiding in the recovery of compromised motor functions.A pivotal aspect of this process involves tailored movement exercises conducted by a doctor or physical therapist.Early and intensive rehabilitation training proves more efficacious in facilitating the recovery of patients' motor abilities (Postolache et al., 2020).However, the demand for patients to attend hospitals or rehabilitation centers for rehabilitation training presents additional challenges for those already grappling with mobility difficulties.Furthermore, the high cost associated with rehabilitation training becomes a financial impediment for certain patients.In this context, the emergence of virtual reality-based rehabilitation methods becomes apparent.These methods allow patients to engage in a more convenient and economical rehabilitation option through personalized virtual reality devices.By offering real-time user pose reconstruction and employing immersive interactive methods, virtual reality technology can provide patients with increased sensory stimulation and a more immersive environment during rehabilitation training (Adamovich et al., 2009).Existing research has shown that compared to conventional physical therapy, virtual reality-based rehabilitation training is more effective in promoting gait recovery in patients with Parkinson's disease (Feng et al., 2019).
However, common virtual reality devices can only accurately reconstruct the poses of user's head and hands through headmounted displays and handheld controllers.This limitation is insufficient for full body rehabilitation training.Therefore, virtual reality-based rehabilitation methods often require additional wearable body measurement sensors to capture patients' movements (Huang et al., 2018;Jiang Y. et al., 2022), such as motion sensors for the legs and waist, gait detection devices, and more.Unfortunately, for patients, this not only represents an additional expense, but wearing extra sensors may also lead to physical discomfort.Moreover, patients' unprofessional handling of these sensors can result in tracking inaccuracies and affecting the effectiveness of the rehabilitation training.Hence, the studies that leverage the most prevalent virtual reality devices, using the sparse motion signals from the head and hands to reconstruct full body poses, demonstrate an effective solution for virtual reality-based rehabilitation training methods.
In addition, regardless of the form of rehabilitation, the quality assessment and precise guidance of patients' recovery movements are crucial (Qiu et al., 2022).This directly influences the effectiveness of patient's recovery.When patients participate in rehabilitation training at hospitals or rehabilitation facilities, doctors can assist by correcting their inaccurate movements, ensuring that their movements fall within the normal range to achieve the desired rehabilitation effects.This correction helps prevent secondary injuries resulting from incorrect movements.However, in virtual reality-based rehabilitation, there is currently no universally recognized solution to reasonably evaluate the quality of patients' rehabilitation movements (Qiu et al., 2022).Furthermore, there is no method to authentically simulate a doctor's supervision to aid patients in correcting rehabilitation movements.Therefore, proposing effective methods for correcting rehabilitation movements and providing appropriate movement guidance is crucial to advancing research in virtual reality-based rehabilitation.
To address the aforementioned issues, we propose a virtual reality-based rehabilitation method.As shown in Figure 1, this method utilizes commonly available virtual reality hardware devices to reconstruct full-body poses of patients.Then, a pose correction module based on a natural language model is employed to assess patients' movements, which generates specific movement correction guidance text by referencing standard movements.Specifically, to meet the requirement of accuracy, real-time performance and smoothness in full-body poses reconstruction, a deep learning-based model is introduced, comprising a multiscale temporal feature switch module and a stacked MLP Blocks.The multi-scale temporal feature switch module expands the model's temporal receptive field, improving the accuracy and the smoothness of full-body poses reconstruction while ensuring realtime performance and model light-weighting.Subsequently, the reconstructed poses and the reference standard poses are input into the poss correction module, which assesses the reconstructed poses and outputs the movement correction guidance text by utilizing a natural language model.Comprehensive experimental results demonstrate that the proposed method can provide more accurate full-body poses reconstruction and more intelligent movement guidance for virtual reality-based rehabilitation training.

Related works . Full-body pose reconstruction from sparse motion signals
In recent years, the reconstruction of full-body poses using sparse motion signals from virtual reality devices, specifically headmounted displays and handheld controllers, has become a focal point in research within the realms of virtual reality and the metaverse.Ahuja et al. introduced a convolutional neural network to extract features from sparse motion signals and utilized a Knearest neighbors (KNN)-based method, employing interpolation algorithms to reconstruct the full-body poses from a limited motion database (Ahuja et al., 2021).However, this method heavily relies on the motion database, exhibiting poor generalization capabilities.In subsequent studies, novel deep learning models such as variational autoencoders (Pavlakos et al., 2019), long short-term memory networks (Yu et al., 2019), and transformers (Jiang J. et al., 2022;Luo et al., 2022;Zhang X. et al., 2023) have been applied to extract motion features from sparse motion signals, significantly enhancing the accuracy of full-body poses reconstruction.In recent studies, based on Multilayer Perceptron (MLP), a diffusion model has been employed to further optimize the reconstructed motion sequences, effectively alleviating the phenomenon of joint jitter (Du et al., 2023).However, the adoption of the diffusion model has substantially increased the computational demands and inference time of the model.

. Pose evaluation for healthcare application
With the development of electronic information and computing technology, studies focusing on health applications, particularly the evaluation of human body poses during rehabilitation training, has been recently explored.Martınez et al. utilized depth cameras to capture the ground-truth human rehabilitation postures and quantitatively evaluated the accuracy of commonly used pose reconstruction algorithms (Martınez, 2019) in reconstructing rehabilitation postures (Hernández et al., 2021).Kidziński et al. (2020) (Bruce et al., 2021).However, the quantitative evaluation metrics of these methods are exclusive to proficient medical professionals for clinical evaluations, limiting their applicability for patients to comprehend the status of their rehabilitation training and make corrections.
To address the aforementioned issues, Qiu et al. ( 2022) devised a pose matching network, which achieves alignment and correction of poses between the trainers' pose and the standard poses, providing trainers with visualized movement guidance through Class Activation Maps (CAM).Despite having a certain foundation in research, the majority of these methods heavily rely on computer vision and are impractical for virtual reality-based rehabilitation.Moreover, there remains a dearth of intuitive and effective guidance for patients in evaluating their poses, such as the guidance provided by medical professional.

. D human poses and natural language models
In recent years, Transformer-based natural language processing models have achieved remarkable success in various fields.The following will introduce datasets that combine human poses with natural language processing models and showcase astonishing applications.The AMASS dataset (Mahmood et al., 2019) has collected motion data for numerous 3D human poses in the form of SMPL (Loper et al., 2023).Then, BABEL (Punnakkal et al., 2021) and HumanML3D (Guo et al., 2022), building on the AMASS dataset, provide free-from textual descriptions for its sequence data.These datasets focus more on describing the entire action sequences rather than the semantic information of each single-frame pose.Consequently, they are more suitable for tasks for generating action sequences (Zhang J. et al., 2023) or describing motions from videos.To address the gap in independent human pose semantic descriptions, PoseScript (Delmas et al., 2022) provides descriptions for each single-frame human poses from some subsets of the AMASS dataset.In further research, FixMyPose (Kim et al., 2021) and PoseFix (Delmas et al., 2023) can connect two different poses and generate textural information for pose correction.Unlike FixMyPose, which generates textual annotations from rendering 2D images, the PoseFix directly generates text explanations based on the 3D human pose data.This proves to be a more suitable and more potent solution for the virtual reality-based rehabilitation training.

Methods . Overview
Reconstructing full-body movements from sparse motion inputs is quite challenging.Sparse motion signals from the upper body cannot effectively constrain the movements of the lower body.As a result, the reconstruction of lower body poses may inevitably exhibit anomalies such as joint jitter and floor penetration, significantly affecting the user's experience in virtual reality.In previous studies (Du et al., 2023), one-dimensional convolution with temporal awareness-based diffusion models was employed to reduce joint jitter, noticeably enhancing the quality and fluency of full-body posture reconstruction.However, the diffusion model requires multiple inference steps, leading to longer model inference times that do not meet the real-time requirements of virtual reality applications.Moreover, when using only its MLP backbone network, joint jitter phenomena remain unresolved.
Therefore, as shown in Figure 2, a full-body pose reconstruction network based on a multi-scale temporal switch module is proposed.The sparse motion signals are input to an MLP .Full-body pose reconstruct module

. . Data preparation
To reconstruct the full body's poses, sparse motion signals are acquired from the Inertial Measurement Unit (IMU) devices on the virtual reality headset and handheld controllers.Each signal at every position includes global positional information p 1×3 and rotational information θ 1×3 about the three axes.For a more refined reconstruction outcome, predicting the human body's motion posture at time t poses from a certain time interval T before time t is aggregated and jointly fed into the network.Therefore, the full-body joint pose U joints full is obtained by applying the mapping function to the set of sparse inputs {p i , θ i } 1 : T , as shown in Equation 1: where n represents the number of sparse inputs, h is the quantity of full-body joints, and T is the count of continuous motion frames observed from the past.
To enable the model to comprehensively learn features from sparse motion signals, the following preprocessing steps are employed.The backward finite difference method is employed to initiate the calculation of linear velocity v 1×3 , as shown in Equation 2: Subsequently, the angular velocity 1×6 t is defined by considering the orientation matrices R of the sparse input (Jiang J. et al., 2022), as shown in Equation 3: These matrices are initially derived from the θ 1×3 representation, which are converted to the rotation matrix R 3×3 using the conversion as previous studies (Zhou et al., 2019;Jiang J. et al., 2022).Following this, the last row of R is disregarded to yield the 6D rotation representation w 1×6 t .Consequently, each input at time frame t i comprises four vectors: p i , v i , i , and w i .This input feature is structured as Equation 4: As a result, all independent signals x t within the time interval T are concatenated along the temporal dimension to form the input signal X, as shown in Equation 5: where B represent the batch size, T signifies the length of the temporal sequences, and F denotes the feature dimension.Therefore, the feature dimension F of the input tensor X amounts to 54.

. . Multi-scale temporal switch module
In previous research (Du et al., 2023), networks equipped with one-dimensional temporal convolutions are employed to enhance the model's temporal awareness, aiming for improving reconstruction of full-body poses.Additionally, the powerful generative ability of diffusion model is utilized to further optimize the reconstructed pose sequences, significantly reducing the occurrence of joint jitter.However, despite the application of Denoising Diffusion Implicit Model (DDIM) technology (Ho et al., 2020), the diffusion model still necessitates five repeated inference steps to obtain the final predictions, which fails to meet the realtime requirements of virtual reality-based rehabilitation training.
To address these issues, a multi-scale temporal switch module based on two-dimensional time sequences is devised.This module comprises multiple branches at different temporal scales, aiding the model in capturing subtle temporal features within the sparse motion signals.
Initially, the preprocessed sparse motion signals X are fed into a Linear Layer for preliminary feature embedding, as shown in Equation 6: where the LinearEmbedding is a linear layer with an input dimension of 54 and an output dimension of 256.As shown in Figure 2, the module comprises K branches representing different temporal switch scale.For each time slice T = t alone the temporal dimension, the feature F t is partitioned into three segments along the feature dimensions, such as f t 1 , f t 2 , and f t 3 .Here, as shown in Equation 7, the f t 1 and f t 3 are the first N features and the last N features alone the feature dimension, respectively: where Inspired by previous study (Zheng et al., 2022), we conduct K forward feature exchange modules alone the temporal direction for each feature slice F t .In the branch where K = k, we exchange the features f t 1 and f t 3 in the feature slice F t with the corresponding features f t+k 1 and f t+k 3 in the feature slice F t+k , where F t+k represents the feature slice at a temporal distance of k frames, as shown in Equation 8: After the exchange of features slice for all time frames T = t, we concatenate all the time slices along the temporal dimension to obtain the output F k of the feature exchange module K = k in Equation 9: Finally, features from different branches are input into an MLP layer for feature fusion, as shown in Equation 10: where ⊙ represents the concatenate operation and F switched ∈ R B × T × F, and the OutPutLinear is a linear layer with an input dimension of K * F and an output dimension of F and the SiLu activative function.
The difference between previous study and ours is that our approach solely employs forward switch along the temporal direction, refraining from bidirectional switch.Our rationale lies in the fact that bidirectional switch necessitates a greater number of feature switch operations for a limited enhancement.Given the constrained computational capacity of the virtual reality devices and the stringent demands for real-time processing, we opt for unidirectional propagation.

. . MLP based blocks
In the recent research (Du et al., 2023;Guo et al., 2023), the potential of MLP-based networks in full-body poses reconstruction tasks has been demonstrated.The MLP-based networks can effectively learn complex non-linear mapping relationships of input features, facilitating efficient feature learning and data representation (Guo et al., 2023).Additionally, the MLP networks possess the advantage of lightweight design, meeting the realtime requirements of our tasks.Considering the demands for both real-time processing and accuracy of the reconstructed poses, our model only employs several commonly used and effective modules in the field of deep learning, including fully connected layers, Silu activation function, one-dimensional convolution alone the temporal dimension with a size of 1, and the layer normalization.Specifically, the one-dimensional convolution layer is utilized to aggregate the temporal features from the entire pose sequence, while the other modules operate on the feature dimension to help the network alleviate gradient vanishing and overfitting phenomena.The structure of the MLP-based blocks is shown in Figure 3.To better extract features from sparse motion signals, the MLP-based blocks are stacked in M layers as in the study by Du et al. (2023).
As the proposed temporal switch module affects the spatial information of the original motion signals, the original feature f is preserved and fed into the aforementioned MLP-based Blocks for feature extraction in Equation 11: where Finally, the output feature F 0 and the temporal switched feature F switched are aggregated and input into the output MLP Layer to reconstruct the poses of 22 joints (excluding the joints of the palms) in the SMLP human pose model, achieving the reconstruction from sparse motion signals to full body poses, as shown in Equation 12: where the ⊕ represents tensor addition operation and .

NLP-based pose correction module
In this section, the state-of-the-art pose evaluation method, PoseFix (Delmas et al., 2023), is employed to compare the reconstructed full body poses and the standard poses, and generate professional motion correction guidance text.We will briefly elucidate how the reconstructed pose Pose A of the patient is matched to the target pose B and modeled as correction guidance text.As shown in Figure 4, the rotation angles of the root joint of Pose A are aligned with the corresponding rotation angles of Pose B .Subsequently, a Transformer-based auto encoder (Kingma and Welling, 2013) is utilized to extract independent 32-dimensional embedded features from Pose A and Pose B .It is noteworthy that the Pose A and Pose B share the weights of the auto encoder.Next, the TIRG network (Vo et al., 2019), a widely applied module for compositional learning, is used to merge .

FIGURE
The components of the MLP based blocks.

FIGURE
The structure of the pose correction module, which outputs pose correction guidance text for the users by contrasting the reconstructed pose with the standard reference pose.
latent features from the embedded features of Pose A and Pose B .
The TIRG (Vo et al., 2019) network comprises a gate network consisting of two MLP layers and two learnable weights, which is designed to retain the primary motion features and introduce additional improvement through residual connections.As shown in Equation 13: where FC f and FC g are MLP layers, and their weights are balanced by learnable parameters w f and w g .Finally, the fused features F prompt are fed into a Transformerbased auto-regressive model, serving as a prompt to guide the natural language processing model in generating motion correction guidance text.In the decoding process of the Transformer-based auto-regressive model, the input feature F prompt is concatenated with a vector F caption , composed entirely of ones, serving as additional positional encoding.This combined input is then fed into the Transformer model.Leveraging the attention mechanism of the Transformer, the prompt is decoded into a probability distribution of text embeddings, and the first text result T 1 is obtained through the softmax function.Subsequently, T 1 is integrated into F caption , concatenated again with the input feature F prompt , and fed into the Transformer model to obtain the second text result T 2 with the highest probability.This iterative process  continues, employing the method of iterative greedy decoding, until the entire sequence is decoded.

Experiments . Training details
To train the full-body pose reconstruction model based on sparse motion signals, three subsets of the AMASS dataset-CMU (Carnegie Mellon University), MPI-HDM05 (Max Planck Institute Human Motion Database 2005;Müller et al., 2007), and BioMotionLab-NTroje (Troje, 2002) are employed for model training and test.Specifically, we obtain 2,074, 215, and 3,061 motion sequences from these three subsets, covering commonly used actions in virtual reality such as walking, running, jumping, dancing, kicking, tool manipulation, and social behaviors and interpersonal interactions.Out of 5,350 motion sequences, 536 are randomly selected for model validation, with the remaining 4,814 used for model training.These motion sequences are stored in the format of SMPL model parameters, encompassing 156-dimensional joint motion parameters.
To emulate the hardware configuration of virtual reality devices, we extract the motion parameters of the head joint and wrists of both hands, inputting them into the model, and reconstruct the motion parameters of 22 body joints (excluding the joints of the palms).To ensure a fair comparison with previous methods, consistent experimental parameters are employed: the stacking layers of the MLP module M are set to 12, and the feature dimension F was set to 512.Both training and testing were conducted on an NVIDIA 4090 GPU using the PyTorch framework (Paszke et al., 2019).
For the natural language processing model-based pose correction module, we make no modifications and training to the PoseFix model.In PoseFix, a pipeline based on PoseScript is employed to compare the distance variations between multiple 3D keypoints for 135 k pairs of different actions.The resulting data are organized in structural order, forming the 135 k action correction guidance text dataset.This dataset is utilized for training the pose correction model.Additionally, the frozen DistillBERT (Sanh et al., 2019) is employed for word embedding.Instead, we directly utilize publicly available model weights, as experimental results have already demonstrated that this method accurately evaluates differences between two poses and generates precise correction guidance text.

. Evaluation metrics
To validate the effectiveness of the proposed method, the following evaluation metrics are employed to assess the model's performance and compare it with previous state-of-the-art methods (Du et al., 2023): Mean Per Joint Rotation Error (degrees; MPJRE) and Mean Per Joint Position Error (cm; MPJPE) measures the average relative rotation error and position error for each joints, which indicated the absolute errors of the model predictions.While the Mean Per Joint Velocity Error (cm/s; MPJVE) measures the average velocity error for the joints' positions, the Jitter (Yi et al., 2022) evaluates the mean jerk (change in acceleration over time; Du et al., 2023) of the joints in global space.These metrics can measure the smoothness of reconstructed poses, which directly relates to the user's overall experience.Specifically, jitter delineates the rate of change of acceleration in joint positions, serving as an indicator of the degree to which abrupt changes occur in joint positioning.Consequently, it proves valuable in characterizing and analyzing the dynamic aspects of motion, facilitating an evaluation metric of the smoothness of reconstructed poses (Flash and Hogan, 1985).The jitter is calculated as Equation 14: where p represents the joint position and t denotes the time.By computing the second derivative of each joint position with respect to time, jitter can be derived. .

Evaluation results
In Table 1, we present the quantitative comparison results between our proposed model and the state-of-the-art method AGRoL (Du et al., 2023).As shown in Table 1, our approach demonstrates improvements across various metrics compared with the AGRoL's MLP-based backbone method.Moreover, the phenomenon of joint jitter has been noticeably mitigated.In comparison to AGRoL's diffusion model method, we maintain a comparable prediction accuracy, albeit with less pronounced joint jitter.However, our method requires only 60.4% of the model's parameters compared with this method, enhancing its practical applicability.
In Figure 5, we showcase comprehensive applications of virtual reality-based rehabilitation and partial action guidance.As shown in the figure, the reconstructed poses and target poses can be accurately evaluated by the PoseFix network, yielding intuitive, detailed, and precise action guidance.

Conclusion
Current rehabilitation training requires patients, who already face mobility challenges, to visit rehabilitation centers for treatment by physical therapists.This proves to be difficult and costly for patients.To enable patients to undergo precise, efficient, and costeffective rehabilitation training in the comfort of their homes using their virtual reality devices, this study introduces a novel approach that utilizes sparse motion signals from VR devices, specifically head-mounted displays and hand controllers, to reconstruct fullbody poses.Unlike existing methods that focus on clinical metrics for doctors, our method employs a natural language processing model to contrast reconstructed poses with standard poses.This process provides efficient pose correction guidance in the form of natural language, offering a more accessible and personalized approach to movement guidance for patients.
The quantitative and qualitative results demonstrate the effectiveness of the proposed method in real-time reconstruction of accurate full-body poses.By referencing standard poses, the model generates professional motion correction guidance text, facilitating virtual reality-based rehabilitation training.This approach not only reduces the cost of rehabilitation training but also enhances the efficiency of self-rehabilitation training, addressing the challenges faced by patients seeking timely and accessible rehabilitation.

FIGURE
FIGUREThe pipeline of the proposed virtual reality-based rehabilitation training method.

FIGURE
FIGUREThe detailed structure of proposed full-body pose reconstruct module and temporal switch module.

FIGURE
FIGUREPartial results of the virtual reality-based rehabilitation training method indicate that users can correct their poses with guidance from the pose correction text.
introduced a neural network to quantitatively evaluate clinically relevant motion parameters from TABLE Comparison of our approach with state-of-the-art methods on the subsets of AMASS.