Abstract
Introduction:
With an enormous number of hand images generated over time, leveraging unlabeled images for pose estimation is an emerging yet challenging topic. While some semi-supervised and self-supervised methods have emerged, they are constrained by their reliance on high-quality keypoint detection models or complicated network architectures.
Methods:
We propose a novel selfsupervised pretraining strategy for 3D hand mesh regression. Our approach integrates a multi-granularity strategy with pseudo-keypoint alignment in a teacher–student framework, employing self-distillation and masked image modeling for comprehensive representation learning. We pair this with a robust pose estimation baseline, combining a standard vision transformer backbone with a pyramidal mesh alignment feedback head.
Results:
Extensive experiments demonstrate HandMIM’s competitive performance across diverse datasets, notably achieving an 8.00 mm Procrustes alignment vertex-point-error on the challenging HO3Dv2 test set, which features severe hand occlusions, surpassing many specially optimized architectures.
1 Introduction
Image-based 3D hand reconstruction technology has widespread applications in the smart film industry, such as motion capture, special effects synthesis, virtual production, post-production animation modification, and interactive film production. Meanwhile, 3D hand mesh estimation from monocular RGB images has drawn great attention in computer vision research [, ] driven by its potential in various applications, such as action recognition [, ], digital human modeling, simultaneous localization and mapping (SLAM) [–], and AR/VR. However, training a high-quality hand estimation model is challenging due to complex backgrounds and severe self-occlusion. Furthermore, it is laborious and costly to collect high-quality training pairs, especially in the format of 3D mesh. A limited amount of image-mesh training data are available, making it difficult to train effective and generalizable models. Weakly supervised methods detecting 2D keypoints or measuring noisy depth maps [] or kinematic priors [] from off-the-shelf models have been proposed to improve the accuracy of supervised-trained models. However, these methods heavily rely on fine-grained keypoint detectors, such as MediaPipe [], which struggle with the wide variety of wild images encountered in practice and may produce many noisy labels.
Self-supervised learning is a promising technique for addressing the above problem by exploiting the large quantity of unlabeled image data generated over time. Masked image modeling (MIM) pretraining has emerged as a new paradigm in self-supervised learning based on the vision transformer [] architecture that divides images into individual patches. In MIM pretraining, we randomly mask a specified ratio of image patches and set the self-supervised learning target to reconstruct the masked patches. Previous works [, ] have demonstrated that MIM-based methods can learn better local and global representation than conventional self-supervised methods based on contrastive learning []. In contrast to traditional self-supervised methods based on contrastive learning, which focus on high-level feature representation suitable for image classification, MIM-based methods can learn better local and global representations. This is especially critical for low-level, fine-grained regression tasks such as 3D hand estimation, where capturing the equivalence of geometric transformations is essential. The potential ability of MIM to reconstruct masked patches allows the model to understand the spatial relationships within an image at a finer granularity, making it more adept at handling detailed structures like the human hand.
However, most existing self-supervised work focuses on recognition tasks and aims to learn features appropriate for high-level image classification tasks. In low-level regression tasks, mainstream methods cannot capture the equivalence of geometric transformation, a critical characteristic of human/hand pose or mesh regression. Therefore, most state-of-the-art MIM self-supervised pretraining approaches must be adapted for regression tasks such as 3D hand estimation. Figure 1 exhibits the difference between our MIM approach and the previous ones. MIM’s extension to regression tasks like 3D hand mesh estimation offers significant advantages. It leverages the strengths of MIM—such as detailed feature capture and understanding of spatial relationships—while introducing mechanisms specifically tailored for the challenges of regression tasks. We confirmed the abovementioned findings through experiments in Section 4.3.
FIGURE 1
In this paper, we conduct the first attempt to apply the effective masked image modeling (MIM) self-supervised technique to 3D hand estimation tasks. We propose HandMIM, a unified and multi-granularity self-supervised pretraining strategy optimized for pose regression tasks. During the pretraining period, we use a teacher-student self-distillation approach, where input hand images are augmented into two views that vary in sizes, rotations, colors, and other factors. The student network is then tasked with reconstructing masked tokens under the guidance of the teacher network. To ensure that the class tokens are semantic with pose-aware knowledge, we introduce the pseudo-keypoint alignment operation in the latent feature space. This operation allows us to undo the geometric transformation in the format of 2D pseudo-keypoints, enabling the network to learn pose equivalence between cross-view tokens. To facilitate high-level and low-level recognition, we adopt token-level recovery between parallel-view masked tokens and pixel-level reconstruction between masked input images and recovered images, respectively. It is important to note that the token recovery is conducted in the same latent space as the pose-aware alignment. We sketch our method in Figure 2 and compare it with related self-supervised works [
FIGURE 2

Comparison with other self-supervised frameworks for hand shape/pose estimation. Left: TempCLR [
Conclusively, the main contributions of our work are in four folds:
1. We adopted a new self-distillation method for 3D hand mesh estimation. This method markedly enhanced the efficiency of learning from potentially unlimited unlabeled hand image data.
2. We designed the pose-aware keypoint alignment mechanism for the MIM paradigm, making HandMIM exploit the pose knowledge, which was originally coupled with task-irrelated information (such as color and affine transformation) from images.
3. The integration of token-level self-distillation and pixel-level reconstruction in our framework allowed the effective learning of both high- and low-level features. These features are crucial for fine-grained regression tasks, including hand mesh estimation.
4. To our knowledge, HandMIM represented the inaugural model pre-trained with masked image modeling mechanics, specifically in the field of hand mesh estimation.
2 Related work
2.1 Hand pose estimation
Estimating hand poses aims to predict hand information from a monocular RGB/depth image and can be broadly classified into parametric and non-parametric methods. Parameter methods [
2.2 Vision transformer (ViT)
ViT [
Most prior works have designed complex structures on top of the transformer or attention blocks. Accordingly, standard transformers cannot easily achieve competitive performance. Our approach attempts to leverage large quantities of unlabeled hand images and surpass existing methods solely based on the standard ViT backbone without any delicate domain-related architecture, demonstrating the effectiveness of our self-supervised regression learning algorithm.
2.3 Self-supervised learning
Self-supervised learning is an approach to learning effective feature representation from abundant unlabeled images. Contrastive learning techniques [
3 Methods
In this section, we will discuss the detailed architecture of HandMIM. The pipeline of HandMIM can be found in Figure 3. We start with preliminaries, including basic knowledge of vision transformers, masked image modeling, and self-distillation techniques in Section 3.1. Then, we introduce the detailed design of HandMIM, including pose-aware keypoint alignment in Section 3.2, token-level self-distillation in Section 3.3, and pixel-level reconstruction in Section 3.4. Finally, we illustrate how to apply pre-trained features after self-supervised learning for 3D hand mesh estimation tasks in Section 3.5. The PyTorch-like pseudocode of HandMIM is listed in Algorithms 1–3.
FIGURE 3

Overall framework of HandMIM. During the self-supervised pretraining phase, we design multi-granularity tasks to acquire pose-aware knowledge, high-level token recovery, and low-level pixel reconstruction. We propose a simple baseline based on the standard vision transformer architecture and the PyMAF [
3.1 Preliminaries
3.1.1 Vision transformers
Given input images , a vision transformer [
Algorithm 1
Input:
batch size , constant softmax temperature , student and teacher network , logit center .
Pseudo code:
for sampled minibatch images do
for alldo
draw two random augmentation functions
⊳ random data augmentation for image, is the image crop parameters (left, top, height, and width), and is the 2 × 2 rotation matrix.
⊳ randomly mask the image
⊳ tokens encoded by student network
⊳ tokens encoded by teacher network
⊳ reconstruct the image from tokens
= PAA ⊳ Pose-aware alignment
= PAA ⊳ Pose-aware alignment
= S-D + S-D ⊳ [CLS] token loss
= S-D.mean () + S-D.mean () ⊳ [Patch] token loss
= .mean ()+.mean () ⊳ image reconstruction loss
end for
Update network to minimize
Update network using exponentially moving average (EMA)
Update logit center by moving average
end for
Return student network .
Algorithm 2
Input: .
Pseudo code:
= softmax
= softmax
return.sum (dim = −1)
The output of the self-attention module is then passed through an inverted bottleneck multi-layer perceptron (MLP), also known as the feed-forward network. In practice, vision transformers are assembled by stacking a series of transformer blocks. We can obtain models of varying sizes by varying the channel width and layer depth of vision transformers.
Algorithm 3
Input: .
Pseudo code:
= MLP
= .reshape (−1, 2)
= ( * )/img_size
= .reshape (−1)
return
3.1.2 Masked image modeling
Masked image modeling (MIM) is a self-supervised learning technique that has been demonstrated to be a general method for image recognition tasks in many recent works [33]. Given input tokens , randomly create a binary mask . When , the origin image tokens are passed through the neural network backbone, and when , the input tokens are replaced with a special mask token . By doing so, we obtain both the original tokens and the masked tokens , which are calculated as . The goal of the masked image modeling task is to train the backbone function and minimize the following loss function Equation 2 to recognize and recover the original tokens from the masked tokens :
MIM encourages the model to learn robust local and global image representations, which is especially important for tasks requiring fine-grained understanding, such as 3D hand mesh estimation.
3.1.3 Self-distillation
Self-distillation is a common technique adopted in recent self-supervised learning frameworks [34, 35]. Given an input image , we apply two random data augmentations to the image, denoted as and , respectively. During training, we treat the backbone function as the student network. The teacher network shares the same architecture as the student network, but its weights are updated using the exponential moving average of the student weights rather than through gradient updates. The goal of self-distillation is to minimize the following consistency loss function Equation 3, which enforces consistency between the output features from the student and teacher networks using and , respectively, where is the distance metric, such as Kullback–Leibler divergence or L1/L2 loss functions:
3.2 Pose-aware keypoint alignment
We observe that the 2D pose of hands in input images remains equivalent after some spatial data augmentation, such as random rotation and resizing operations, while the positional information is altered. As justified in our experiments, existing mainstream self-supervised learning methods fail to capture the knowledge of “poses.” In this work, we propose the idea of pose-aware keypoint alignment to extract the pose-relevant knowledge. This is critical for 3D hand mesh estimation tasks, where understanding and preserving the geometric relationships between keypoints (or joints) is essential for accurate reconstruction. Moreover, we choose this method because it efficiently and effectively captures and utilizes pose-relevant knowledge, integrates seamlessly with the self-distillation and multi-granularity learning paradigms, and enhances the overall performance and robustness of 3D hand mesh estimation.
Consider a point in input image . After the augmentation process, the point is transformed to Equation 4:where denotes 2D rotation matrix, denotes scale factor, and denotes the upper left coordinate of the resized image. After the last transformer layer, we obtain the output class token in latent space, where we can regard it as a set of pseudo points and reshape into the format of point . We can then recover the linear transformation to get the original latent feature before any spatial augmentation as the folloiwng functions Equations 5, 6:
Then, we apply the softmax function to and obtain class token features to compute the cross-entropy self-distillation losses as depicted in the following subsection. After the pose alignment in latent space, image features after different augmentations exhibit a unified “hand pose,” facilitating the extraction of pose-sensitive knowledge by vision backbones. In the following subsection, we will elaborate on how to learn the pose-aware task.
3.3 Token-level self-distillation
The knowledge of masked image modeling can be acquired through a self-distillation approach proposed by DINO [36]. We treat self-supervised learning as a discriminative task involving two backbones with identical architecture, which play the roles of a teacher network and a student network . Specifically, we train the student network to comprehend corrupted input tokens under the guidance of the teacher network, which receives complete input tokens .
To fully recognize the images, we use two random image augmentations, denoted as and ; thus, we get augmented tokens for the teacher network . We then apply a randomly generated mask to the augmented tokens after the patch embedding layer, resulting in corrupted tokens and for the student network . The process in the student and teacher networks can be formulated as the following Equation 7:
Note that the softmax function is applied to the channel dimension. We use uppercase letters, that is, and , where is the last latent dimension, to denote the output probability distribution [
We design specific tasks of self-supervised learning for the class tokens, considering their semantic meanings. For the class tokens, we aim to extract the pose of the original images, which is equivalent after the inverse operation of spatial data augmentations, implemented as pose-aware keypoint alignment in Section 3.2. Because we expect images under different augmentations to have the same pose expression, we adopt a cross-entropy loss between the cross-view images and apply the self-distillation approach in Section 3.1 to measure the discrepancy between teacher and student distribution. Specifically, we obtain the loss, which can be formulated as the following function Equation 8:
During the backward period, only the student network requires gradient backpropagation, as we treat the output of the teacher network as ground truth. Subsequently, we update the teacher network through an exponentially moving average (EMA) using the student network.
Given the patch output of the transformer backbone, which represents the spatial knowledge of input images, we can define the patch loss . This loss measures the discrepancy between the parallel-view tokens, which share the same spatial position after the augmentations. Specifically, we aim to train our module to recover the corrupted patch tokens. We learn the knowledge using a similar self-distillation approach as in Equation 8 using Equation 9:
3.4 Pixel-level reconstruction
Hand pose estimation is a low-level task that involves directly analyzing image pixels, in contrast to image classification. Although token-level self-distillation may be effective for higher-level knowledge, it may lack the necessary low-level understanding. To address this, we propose a pixel-level reconstruction module. Because transformer tokens are applied in a patch-based manner, we integrate a pyramid fusion layer following certain intermediate transformer layers and gradually up-sample using transposed convolution (T-Conv). The convolution stride is set to 2. The resulting pyramid fusion output feature maps are concatenated with each transformer block output and fused using a linear layer. Mathematically, this can be represented as the following Equation 10:
In common practice, vision transformers use a patch size of 16; therefore, four iterations of transposed convolution are adopted to recover the original shape of input image . We can adopt L1-Loss between input images and reconstruction results using Equation 11:where denotes the token mask, and denotes the Hadamard product. Note that only the student network requires a gradient; therefore, we only adopt at the student network with masked input.
The final loss function Equation 12 is the sum of the losses mentioned above:
The above loss function indicates that HandMIM can capture both local detail features and global geometric context via a vision transformer backbone. The transformer architecture naturally handles multi-scale information, but HandMIM goes further by introducing a mechanism that specifically targets different levels of granularity. More specifically, the [Patch] tokens represent local regions of the image and are used to capture fine-grained geometric features essential for mesh estimation and refinement. Pseudo keypoints are aligned in the latent space using the [CLS] token, which acts as a global representation of the entire image. By aligning these keypoints, the model can better understand the pose equivalence between different views of the hand after applying spatial augmentations. Finally, the combination of pixel-level reconstruction and multi-granularity feature learning allows HandMIM to learn how to recover pixels from occlusions and handle complex hand–object interactions more effectively, which is particularly beneficial on datasets like HO3Dv2, which feature severe hand occlusions.
3.5 3D hand mesh estimation via ViT
To evaluate the effectiveness and benefits of HandMIM self-supervised pretraining, we fine-tune the pre-trained vision transformer backbone on a supervised 3D hand mesh estimation task. Specifically, we incorporate a keypoint feedback loop after the backbone, similar to the approach used in PyMAF [
The MANO parameter loss is calculated as the L2 distance between the predicted MANO parameters and the ground truth. Given MANO parameters , , , the 3D mesh vertices can be obtained using the MANO model , which can be used to calculate the vertex loss as a more direct form of supervision. Furthermore, the 3D keypoints can be generated by mapping the 3D mesh using a pre-trained linear regressor. By projecting the 3D keypoints onto the image coordinate system, we can obtain 2D keypoints, which can be used to supervise the training process with 2D keypoint ground truth . Overall, the keypoint loss is composed of the 3D keypoint loss and the projected 2D keypoint loss as the following Equation 14:where indicates the ground-truth camera intrinsic matrix following common practice. Together, the pre-trained ViT backbone and the pyramidal mesh alignment feedback head contribute significantly to the superior performance of HandMIM. The ViT backbone’s capacity to learn detailed and hierarchical features and the PyMAF head’s ability to refine the mesh through iterative alignment and direct parameter supervision results in competitive performance across various datasets, especially in challenging scenarios involving severe occlusions.
4 Results
In this section, we conducted extensive experiments to evaluate the proposed self-supervised pretraining framework HandMIM. We first introduce our settings on HandMIM pretraining in Section 4.1. Then, we show the results of our pre-trained model on 3D hand mesh estimation tasks in Section 4.2. Finally, we present in-depth analysis and ablation studies in Section 4.3.
4.1 HandMIM pretraining
4.1.1 Pretraining settings
We employ vision transformers [
4.1.2 Pretraining datasets
As there are currently no standardized datasets for hand pose self-supervised learning, we collect hand images across a variety of datasets for sufficient hand pose and background distributions, including the FreiHAND [
4.2 3D hand mesh estimation
We evaluated the performance of HandMIM models against several competitive methods in 3D hand mesh estimation. Our experiments demonstrate that pretraining HandMIM models significantly enhances the accuracy and quality of visualizations in 3D hand mesh estimation tasks and achieves competitive performance in multiple datasets and metrics.
4.2.1 Setups
For evaluation, we use two challenging publicly available hand pose estimation datasets, FreiHAND [
During training, we set the batch size to 128 and then crop and resize the hand image to . Random scale, translation, rotation, and color jitter are applied for data augmentation. We fine-tune our model using the Adam optimizer for 100 epochs, with a learning rate of . Our ViT-S model achieves a real-time inference speed of 40 frames per second on a single NVIDIA V100 GPU. The detailed architecture, pretraining, and inference time are listed in Table 1.
TABLE 1
| Model | Layer depth | Embed dim | MLP size | Number of heads | Params (M) | Pretraining time (hours) | Inference FPS |
|---|---|---|---|---|---|---|---|
| ViT-Small | 12 | 384 | 1,536 | 6 | 22 | 12 | 40 |
| ViT-Base | 12 | 768 | 3,072 | 12 | 86 | 23 | 15 |
| ViT-Large | 24 | 1,024 | 4,096 | 16 | 307 | 53 | 4 |
Details of the vision transformer architecture, as well as the pretraining and inference time in HandMIM.
4.2.2 Evaluation metrics
We incorporate multiple evaluation metrics for comprehensive analysis and comparison. We use joint-point-error (JPE) and vertex-point-error (VPE) to denote the average L2 distance between the ground truth and predicted keypoints and mesh vertices, respectively. We prefix the metrics with PA and MP to denote Procrustes alignment and scale-and-translation alignment. F-scores are defined as the harmonic means between recall and precision between two meshes given a distance threshold. We also report the area under curve (AUC) following common practice, which denotes the area under the percentage-of-correct-keypoints (PCK) curve for threshold values between 0 mm and 50 mm in 100 equally spaced increments. We report our evaluation results in units by default.
4.2.3 Results on FreiHAND
We compare our approach with existing methods [
TABLE 2
| Method | Params(M) | PAVPE | PAJPE | F@5 | F@15 |
|---|---|---|---|---|---|
| Kulon et al.c [ | - | 8.6 | 8.4 | 0.614 | 0.966 |
| HaMeR/ViT-Base [ | 86 M | - | 10.72 | - | - |
| I2L-MestNet [ | 135 M | 7.6 | 7.4 | 0.681 | 0.973 |
| I2UV-HandNet [39] | - | 7.4 | 7.2 | 0.707 | 0.977 |
| HIU-DMTLb [40] | - | 7.3 | 7.1 | 0.699 | 0.974 |
| Tang et al. [41] | 149 M | 7.1 | 7.1 | 0.706 | 0.977 |
| PeCLR-Res50b [ | 26 M | - | 7.1 | - | - |
| TempCLR-Res50b [ | 26 M | 10.2 | - | 0.541 | 0.941 |
| Mesh Graphormera [ | 204 M | 6.8 | 6.6 | 0.732 | 0.982 |
| MobRecona [ | 22M | 7.2 | 6.9 | 0.694 | 0.979 |
| FastViT [ | - | 6.6 | 6.7 | 0.722 | 0.981 |
| ViT-Small-ImageNet + PyMAF | 22M | 7.1 | 7.2 | 0.697 | 0.978 |
| ViT-Large-ImageNet + PyMAF | 307 M | 6.6 | 6.6 | 0.727 | 0.983 |
| HandMIM-Smallb | 22M | 0.725 | 0.984 | ||
| HandMIM-Baseb | 86 M | 0.731 | 0.985 | ||
| HandMIM-Largeb | 307 M | 0.744 | 0.986 |
Results on the FreiHAND [
denotes non-ensemble evaluation results for a fair comparison.
denotes self-supervised training approaches.
denotes weakly supervised training approaches.
For a fair comparison, we re-trained HaMeR [
FIGURE 4

Pose and mesh AUC comparison with some competitive methods on the FreiHAND and HO3D datasets. * indicates the method is supervised and trained with extra 2D/3D labeled data. It can be observed from the plot that our method achieves the best performance on both datasets for both meshes and poses AUC values with ViT-B as the backbone. (A) FreiHAND Mesh AUC, (B) FreiHAND Pose AUC, (C) HO3D v2 Mesh AUC, and (D) HO3D v2 Pose AUC.
4.2.4 Results on HO3Dv2
For HO3Dv2, existing methods [
TABLE 3
| Method | Params(M) | PAVPE | PAJPE | MPJPE | F@5 | F@15 |
|---|---|---|---|---|---|---|
| Liu et ala. [43] | 34 M | 9.5 | 9.9 | 31.7 | 0.528 | 0.956 |
| HandOccNet [ | 38 M | 8.8 | 9.1 | 24.0 | 0.564 | 0.963 |
| AMVUR [ | 195 M | 8.2 | 8.3 | - | 0.608 | 0.965 |
| Keypoint Trans [ | 48 M | - | 10.8 | - | - | - |
| ViT-Small-ImageNet + PyMAF | 22M | 8.78 | 9.18 | 26.37 | 0.567 | 0.963 |
| ViT-Large-ImageNet + PyMAF | 307 M | 8.43 | 8.73 | 23.57 | 0.588 | 0.970 |
| HandMIM-Small | 22M | 24.00 | 0.597 | 0.970 | ||
| HandMIM-Base | 86 M | 22.01 | 0.610 | 0.971 | ||
| HandMIM-Large | 307 M | 21.94 | 0.617 | 0.972 |
Results on the HO3D v2 [
denotes the self-supervised training approach. Note that our HandMIM-Base already achieves competitive without any complicated designs for hand occlusion issues, such as AMVUR [
The bold values means the optimal performance metric in each colum.
4.2.5 Visualizations
We visualized and compared the hand mesh predictions of our proposed method with some competitive methods on the test sets of FreiHAND [
FIGURE 5

Visualizations on the FreiHAND [
FIGURE 6

Visualization of the HO3D v2 [
FIGURE 7

Qualitative comparison with several methods on the HO3D v2 test set. From left to right, it shows the input images, the overlaid results by HandOccNet [
4.3 Ablation study
In this subsection, we presented a series of convincing analysis experiments and ablations to evaluate the effectiveness of HandMIM. We demonstrated the superiority of our method against existing self-supervised methods through comprehensive comparisons. To assess the generalizability of our method, we perform linear prob, cross-dataset, partial fine-tuning analysis, and visualizations of HandMIM.
4.3.1 Linear probe for keypoint regression
As we enforce the pose-sensitive knowledge in our latent feature, we can adopt the linear prob strategy to validate their effectiveness. Linear probing is an intuitionistic method for a self-supervised-trained model to show the quality of representation learning by freezing the pre-trained backbone and using a simple MLP layer to predict the output. We use the 2.5D joint representation to regress 2D and 3D keypoints jointly. Concretely, we learned two 3-layer multilayer perceptrons (MLPs) to predict 2D keypoints and 1D relative depth, respectively. The resulting 3D keypoints are calculated according to the camera’s intrinsic parameters. We trained our MLP layer on FreiHAND [
TABLE 4
| Method | 2D-error (px) | 3D joint error (cm) |
|---|---|---|
| Random | 21.69 | 80.57 |
| iBOT [ | 9.74 | 23.67 |
| PeCLR [ | 8.94 | 19.41 |
| Ours/ViT-S | 5.75 | 10.83 |
| Ours/ViT-B | 5.19 | 10.59 |
Results of linear probe regression. We compared our method with the mainstream self-supervised learning method, and our HandMIM outperforms existing methods by a large margin.
4.3.2 Comparisons with alternative self-supervised learning methods
As shown in Table 5, we compared the performance of our proposed pose-aware method for 3D hand mesh estimation with two representative self-supervised learning methods, the mainstream masked image modeling method iBOT [
TABLE 5
| Method | Dataset | PAVPE | PAJPE | F@5 | F@15 |
|---|---|---|---|---|---|
| ViT-S | FH | 7.10 | 7.21 | 0.697 | 0.978 |
| ViT-S + iBOT [ | FH | 6.98 | 6.98 | 0.704 | 0.979 |
| ViT-S + PeCLR [ | FH | 8.51 | 8.76 | 0.629 | 0.961 |
| ViT-S + HandMIM | FH | 6.57 | 6.57 | 0.725 | 0.984 |
| ViT-S | HO3Dv2 | 8.71 | 9.05 | 0.571 | 0.965 |
| ViT-S + iBOT [ | HO3Dv2 | 8.56 | 8.84 | 0.581 | 0.966 |
| ViT-S + PeCLR [ | HO3Dv2 | 8.81 | 9.14 | 0.565 | 0.963 |
| ViT-S + HandMIM | HO3Dv2 | 8.22 | 8.57 | 0.597 | 0.970 |
Comparisons with self-supervised methods. We train HandMIM with baselines under the same backbone and pretraining data. Results are evaluated on the FreiHAND [
The bold values means the optimal performance metric in each colum.
4.3.3 Cross-dataset validation
To evaluate the generalizability of our proposed method, we conducted a cross-data validation on 3D hand mesh estimation tasks. Specifically, we fine-tuned our model on the training set of FreiHAND and evaluated its performance on the test set of HO3D v2 and vice versa. Our results, presented in Table 6, demonstrate significant improvements compared to existing self-supervised methods such as PeCLR [
TABLE 6
| Method | Train FH/Test HO3D | Train HO3D/Test FH | ||
|---|---|---|---|---|
| PAJPE | MPJPE | PAJPE | MPJPE | |
| Hasson et ala. [44] | 11.0 | 31.8 | - | - |
| Hampali et ala. [ | 10.7 | 30.4 | - | - |
| PeCLR [ | 13.6 | - | 17.8 | - |
| TempCLR [ | 13.6 | - | 17.0 | - |
| HandMIM/ViT-S | 9.9 | 30.4 | 14.1 | 29.74 |
Cross-dataset analysis on HO3D and FreiHAND. Methods are trained on FreiHAND and tested on HO3D and vice versa.
Indicates the methods are trained and tested on the same dataset. Performances of [43], [
The bold values means the optimal performance metric in each colum.
4.3.4 Visualization of HandMIM pretraining
We are curious about the effects of hand pose estimation after self-supervised pretraining and visualize the results before and after pre-train in Figure 8. The findings demonstrate that HandMIM pretraining enhances the resilience of 3D hand mesh estimation tasks, indicating the beneficial effects of pretraining. Specifically, the results highlight the positive influence of pretraining on the robustness of hand pose estimation. More visualization examples are shown in the supplementary document.
FIGURE 8

Visualizations of HandMIM pretraining. Images in the left column are from the FreiHAND [
4.3.5 Partial fine-tuning
To further explore the efficacy of the learned features, we employ a partial fine-tuning method based on the protocol proposed in [
FIGURE 9

Partial fine-tuning performance comparison between pre-trained weight from mainstream masked image modeling methods and our HandMIM with ViT-Small as the backbone. We use the FreiHAND [
4.3.6 Ablations on self-supervised loss designs
The pose-aware , token-level , and pixel-level losses in our HandMIM framework collaborate to capture distinct levels of representations from input images in a self-supervised manner. To verify the effectiveness of our design, we conduct experiments by removing one of the losses from our framework, as shown in Table 7. The results demonstrate that the removal of any one of the losses results in a decrease in overall precision, justifying the importance of our multi-level loss design. Therefore, our approach can effectively leverage various levels of information to enhance the robustness and accuracy of hand mesh estimation tasks.
TABLE 7
| PAVPE | PAJPE | F@5 | F@15 | |||
|---|---|---|---|---|---|---|
| ✓ | ✓ | ✓ | 6.57 | 6.57 | 0.725 | 0.984 |
| ✗ | ✓ | ✓ | 6.89 | 6.87 | 0.707 | 0.981 |
| ✓ | ✗ | ✓ | 6.90 | 6.85 | 0.708 | 0.981 |
| ✓ | ✓ | ✗ | 6.76 | 6.74 | 0.715 | 0.982 |
Ablation studies. We perform ablations on the loss design of HandMIM. Specifically, we remove all three critical losses , , and in order. We conduct experiments based on the ViT-Small backbone and FreiHAND [
The bold values means the optimal performance metric in each colum.
4.3.7 Scalability of HandMIM pretraining
To justify the scalability of HandMIM on unlabeled images, we conduct pretrain experiments with a certain proportion of the full dataset. Table 8 shows our evaluation results. We obtain better performance with more unlabeled data ( on 50% pretraining data and on the full dataset), indicating that HandMIM holds the potential to further boost performance with abundant unlabeled hand images. We then assessed the accuracy of our model’s estimations by varying the proportion of labeled data used for fine-tuning, specifically at ratios of 10%, 20%, 40%, and 80%. As indicated by the red bins in Figure 10, HandMIM demonstrates remarkable scalability: performance improves with an increase in the amount of labeled data. The model’s estimation error decreases exponentially, aligning with the scaling law as outlined in Tan and Le [45]. Additionally, we also evaluated a model that was supervised and pre-trained with labeled data. The results of this evaluation, represented by blue bins in Figure 10, show that our self-supervised training approach outperforms the traditional method, reducing the error by approximately 40%50%. This finding underscores the significant advantages of our approach in regression tasks related to hand pose estimation and highlights its reduced reliance on labeled training data.
TABLE 8
| Dataset proportion | 25% | 50% | 100% |
|---|---|---|---|
| PAVPE | 7.1 | 6.78 | 6.57 |
Evaluation results on the scalability of HandMIM on unlabeled images.
The bold values means the optimal performance metric in each colum.
FIGURE 10

Scalability of HandMIM pretraining. We pre-trained the backbone ViT-S with two strategies: (i) ViT-S + ImageNet: training ViT-S in a supervised approach with labeled data on ImageNet. (ii) ViT-S + HandMIM: training ViT-S in the self-supervised approach described above with unlabeled data. Both models are connected with PyMAF [
Furthermore, we evaluated the performance of HandMIM across different scales of parameters, specifically using vision transformer small (ViT-S), base (ViT-B), and large (ViT-L) configurations. The results, depicted in Figure 11, demonstrate two key insights: (i) the performance of HandMIM is enhanced with the increase in parameter size, and (ii) HandMIM consistently outperforms other methods when matched for parameter scale.
FIGURE 11

Performance–parameter trade-off of mainstream 3D hand mesh estimation methods on the FreiHAND [
5 Limitations
HandMIM has demonstrated competitive performance across various datasets. Nevertheless, there are still situations where the model might struggle, as shown in
Figure 12.
(1) Complex hand–object interactions. When hands are engaged in complex interactions with objects, the model must infer the occluded parts of the hand based on limited visual cues. Although HandMIM shows promise in these scenarios, there is room for improvement, especially when the interaction involves intricate movements or unusual poses that the model has not encountered during training.
(2) Extreme occlusions. Despite advancements in handling occlusions, extremely occluded hands—where large portions of the hand are hidden or covered by other fingers—remain challenging. In these cases, the model may lack sufficient visible information to accurately reconstruct the hand mesh, leading to increased prediction errors.
(3) Dataset variability. The effectiveness of HandMIM depends on the diversity and quality of the pretraining datasets. If the datasets used for pretraining do not adequately cover certain types of hand poses or backgrounds, the model’s ability to generalize to unseen data may be compromised.
FIGURE 12

The figure demonstrates some common failure cases. (A) Complex hand–object interactions. (B) Extreme occlusion between fingers.
Accordingly, while HandMIM excels in many aspects of 3D hand mesh estimation, it faces challenges related to the quality of pseudo-keypoint generation and potential failures in extreme occlusion scenarios. Addressing these limitations will be essential for further enhancing the robustness and applicability of our model.
6 Conclusion
In this study, we have introduced HandMIM, a novel self-supervised pretraining strategy specifically designed for 3D hand mesh regression from monocular RGB images. Our approach leverages masked image modeling in conjunction with a multi-granularity strategy and pseudo-keypoint alignment within a teacher–student framework, utilizing self-distillation to learn comprehensive representations. By integrating these components, HandMIM achieves significant improvements over traditional supervised methods, reducing errors by approximately 40%–50%. This underscores the effectiveness of our method in requiring less reliance on labeled training data. The experiments conducted across various datasets highlight HandMIM’s robustness and adaptability, particularly under challenging conditions such as severe occlusions. Notably, it achieved an 8.00 mm PAVPE on the HO3Dv2 test set, outperforming many specialized architectures. Furthermore, scalability tests on unlabeled images demonstrated that increasing the dataset proportion from 25% to 100% progressively decreased the PAVPE from 7.1 mm to 6.57 mm, indicating improved performance with more data. Additionally, evaluating HandMIM using different parameter scales revealed that its performance is enhanced with larger models, and it consistently outperforms other methods when matched for parameter scale. These results suggest that HandMIM not only benefits from deeper networks but also maintains superior performance relative to alternative approaches at similar model sizes. For future work, we propose several directions:
• Exploring the integration of temporal information. Current research focuses on single-image-based estimation. Expanding HandMIM to incorporate sequential video frames could enhance pose estimation accuracy and stability.
• Addressing dual-hand interactions. The current scope is limited to single-hand poses. Future efforts should consider extending the model to handle scenarios involving two interacting hands.
• Generalizing to related tasks. Investigating how the principles behind HandMIM can be applied to other human-centric regression and estimation tasks could broaden its impact.
Overall, HandMIM represents a significant advancement in self-supervised learning for 3D hand pose estimation, setting a new benchmark and opening avenues for further exploration.
Statements
Data availability statement
The original contributions presented in the study are included in the article/supplementary material; further inquiries can be directed to the corresponding author.
Author contributions
YL: Writing–original draft, Funding acquisition, Formal analysis, Investigation. CW: Writing–original draft, Conceptualization, Data curation, Methodology. HW: Project administration, Supervision, Writing–review and editing.
Funding
The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. This work was supported by Science and Technology Research Project of Jiangxi Provincial Department of Education (No. GJJ2203419).
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declare that no Generative AI was used in the creation of this manuscript.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
1.
LinKWangLLiuZ. Mesh graphormer. IEEE International Conference on Computer Vision ICCV (2021). p. 12939–48.
2.
HampaliSSarkarSDRadMLepetitV. Keypoint transformer: solving joint identification in challenging hands and object interactions for accurate 3d pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2022). p. 11090–100.
3.
CaiGZhengXGuoJGaoW. Real-time identification of borehole rescue environment situation in underground disaster areas based on multi-source heterogeneous data fusion. Saf Sci (2025) 181:106690. 10.1016/j.ssci.2024.106690
4.
JinWTianXShiBZhaoBDuanHWuH. Enhanced uav pursuit-evasion using boids modelling: a synergistic integration of bird swarm intelligence and drl. Comput Mater & Continua (2024) 80:3523–53. 10.32604/cmc.2024.055125
5.
HuZQiWDingKLiuGZhaoY. An adaptive lighting indoor vslam with limited on-device resources. IEEE Internet Things J (2024) 11:28863–75. 10.1109/JIOT.2024.3406816
6.
ChenJLiTZhangYYouTLuYTiwariPet alGlobal-and-local attention-based reinforcement learning for cooperative behaviour control of multiple uavs. IEEE Trans Vehicular Technology (2024) 73:4194–206. 10.1109/TVT.2023.3327571
7.
ChenJDuCZhangYHanPWeiW. A clustering-based coverage path planning method for autonomous heterogeneous uavs. IEEE Trans Intell Transportation Syst (2021) 23:25546–56. 10.1109/tits.2021.3066240
8.
ZhuPPanZLiuYTianJTangKWangZ. A general black-box adversarial attack on graph-based fake news detectors. In: International joint conference on artificial intelligence (IJACI 2024) (2024).
9.
ZhuPFanZGuoSTangKLiX. Improving adversarial transferability through hybrid augmentation. Comput & Security (2024) 139:103674. 10.1016/j.cose.2023.103674
10.
GuoSLiXZhuPMuZ. Ads-detector: an attention-based dual stream adversarial example detection method. Knowledge-Based Syst (2023) 265:110388. 10.1016/j.knosys.2023.110388
11.
KulonDGulerRAKokkinosIBronsteinMMZafeiriouS. Weakly-supervised mesh-convolutional hand reconstruction in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2020). p. 4990–5000.
12.
SpurrAIqbalUMolchanovPHilligesOKautzJ. Weakly supervised 3d hand pose estimation via biomechanical constraints. In: Proceedings of the European conference on computer vision. Springer (2020). p. 211–28.
13.
LugaresiCTangJNashHMcClanahanCUbowejaEHaysMet al (2019). Mediapipe: a framework for building perception pipelines. arXiv preprint arXiv:1906.08172
14.
KolesnikovADosovitskiyAWeissenbornDHeigoldGUszkoreitJBeyerLet al (2021). An image is worth 16x16 words: transformers for image recognition at scale
15.
ZhouJWeiCWangHShenWXieCYuilleAet alibot: image bert pre-training with online tokenizer. In: International conference on learning representations (ICLR) (2022).
16.
HeKChenXXieSLiYDollárPGirshickR. Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2022). p. 16000–9.
17.
OordAvdLiYVinyalsO. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018). 10.48550/arXiv.1807.03748
18.
SpurrADahiyaAWangXZhangXHilligesO. Self-supervised 3d hand pose estimation from monocular rgb via contrastive learning. In: IEEE international conference on computer vision (ICCV) (2021). p. 11230–9.
19.
ZianiAFanZKocabasMChristenSHilligesO. Tempclr: reconstructing hands via time-coherent contrastive learning. In: International conference on 3D vision (3DV) (2022).
20.
ZhangHTianYZhouXOuyangWLiuYWangLet alPymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In: 2021 IEEE/CVF international conference on computer vision (ICCV) (2021). p. 11426–36. 10.1109/ICCV48922.2021.01125
21.
RomeroJTzionasDBlackMJ. Embodied hands: modeling and capturing hands and bodies together. ACM Trans Graph (2017) 36:1–17. 10.1145/3130800.3130883
22.
ZimmermannCCeylanDYangJRussellBArgusMBroxT. Freihand: a dataset for markerless capture of hand pose and shape from single rgb images. IEEE International Conference on Computer Vision ICCV (2019). p. 813–22.
23.
HampaliSRadMOberwegerMLepetitV. Honnotate: a method for 3d annotation of hand and object poses. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2020). p. 3196–206.
24.
MoonGLeeKM. I2l-meshnet: image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In: Proceedings of the European conference on computer vision (2020).
25.
XiePXuWTangTYuZLuC. Ms-mano: enabling hand pose tracking with biomechanical constraints. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2024). p. 2382–92.
26.
JiangZRahmaniHBlackSWilliamsBM. A probabilistic attention model with occlusion-aware texture regression for 3d hand reconstruction from a single rgb image. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2023). p. 758–67.
27.
VasuPKAGabrielJZhuJTuzelORanjanA. Fastvit: a fast hybrid vision transformer using structural reparameterization. In: Proceedings of the IEEE/CVF international conference on computer vision (2023). p. 5785–95.
28.
WangCZhuFWenS. Memahand: exploiting mesh-mano interaction for single image two-hand reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2023). p. 564–73.
29.
ParkJOhYMoonGChoiHLeeKM. Handoccnet: occlusion-robust 3d hand mesh estimation network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2022). p. 14682–92.
30.
ChenXLiuYDongYZhangXMaCXiongYet alMobrecon: mobile-friendly hand mesh reconstruction from monocular image. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2022). p. 20544–54.
31.
PavlakosGShanDRadosavovicIKanazawaAFouheyDMalikJ. Reconstructing hands in 3D with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2024).
32.
ZimmermannCArgusMBroxT. Contrastive representation learning for hand shape estimation. In: DAGM German conference on pattern recognition. Springer (2021). p. 250–64.
33.
BaoHDongLPiaoSWeiF. BEiT: BERT pre-training of image transformers. In: International conference on learning representations (2022).
34.
HeKFanHWuYXieSGirshickR. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2020). p. 9729–38.
35.
GrillJ-BStrubFAltchéFTallecCRichemondPBuchatskayaEet alBootstrap your own latent-a new approach to self-supervised learning. 34th Conference on Neural Information Processing Systems (NeurIPS 2020) (2020) 33:21271–84. 10.5555/3495724.3496201
36.
CaronMTouvronHMisraIJégouHMairalJBojanowskiPet alEmerging properties in self-supervised vision transformers. IEEE International Conference on Computer Vision ICCV (2021). p. 9650–60.
37.
JinSXuLXuJWangCLiuWQianCet alWhole-body human pose estimation in the wild. In: Proceedings of the European conference on computer vision. Springer (2020). p. 196–214.
38.
XiangYSchmidtTNarayananVFoxD (2018). Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes
39.
ChenPChenYYangDWuFLiQXiaQet alI2uv-handnet: image-to-uv prediction network for accurate and high-fidelity 3d hand mesh modeling. In: IEEE international conference on computer vision (ICCV) (2021). p. 12929–38.
40.
ZhangXHuangHTanJXuHYangCPengGet alHand image understanding via deep multi-task learning. In: IEEE international conference on computer vision (ICCV) (2021). p. 11281–92.
41.
TangXWangTFuC-W. Towards accurate alignment in real-time 3d hand-mesh reconstruction. In: IEEE international conference on computer vision (ICCV) (2021). p. 11698–707.
42.
LimGMJatesiktatPAngWT. Mobilehand: real-time 3d hand shape and pose estimation from color image. In: International conference on neural information processing. Springer (2020). p. 450–9.
43.
LiuSJiangHXuJLiuSWangX. Semi-supervised 3d hand-object poses estimation with interactions in time. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2021).
44.
HassonYVarolGTzionasDKalevatykhIBlackMJLaptevIet alLearning joint reconstruction of hands and manipulated objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (United States: Computer Vision Foundation (CVF) and IEEE) (2019). p. 11807–16.
45.
TanMLeQ. EfficientNet: rethinking model scaling for convolutional neural networks. In: ChaudhuriKSalakhutdinovR, editors. Proceedings of the 36th international conference on machine learning. PMLR, vol. 97 of Proceedings of Machine Learning Research (2019). p. 6105–14.
Summary
Keywords
3D hand mesh estimation, multi-granularity representation, self-supervised learning, masked image modeling, vision transformer
Citation
Li Y, Wang C and Wang H (2025) Toward accurate hand mesh estimation via masked image modeling. Front. Phys. 12:1515842. doi: 10.3389/fphy.2024.1515842
Received
23 October 2024
Accepted
18 December 2024
Published
29 January 2025
Volume
12 - 2024
Edited by
Peican Zhu, Northwestern Polytechnical University, China
Reviewed by
Anas Bilal, Hainan Normal University, China
Abdelkarim Ben Sada, University College Cork, Ireland
Updates

Check for updates
Copyright
© 2025 Li, Wang and Wang.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Huan Wang, wanghuan6@email.szu.edu.cn
Disclaimer
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.