ORIGINAL RESEARCH article

Front. Robot. AI, 14 January 2026

Sec. Robot Vision and Artificial Perception

Volume 12 - 2025 | https://doi.org/10.3389/frobt.2025.1683931

Multi-view object pose distribution tracking for pre-grasp planning on mobile robots

  • 1. SDU Robotics, Mærsk Mc-Kinney Møller Institute (MMMI), Faculty of Engineering, University of Southern Denmark, Odense, Denmark

  • 2. Danish Institute for Advanced Studies (DIAS), Odense, Denmark

Article metrics

View details

1k

Views

52

Downloads

Abstract

The ability to track the 6D pose distribution of an object while a mobile manipulator is still approaching it can enable the robot to pre-plan grasps, thereby improving both the time efficiency and robustness of mobile manipulation. However, tracking a 6D object pose distribution on approach can be challenging due to the limited view of the robot camera. In this study, we present a particle filter-based multi-view 6D pose distribution tracking framework that compensates for the limited view of the moving robot camera while it approaches the object by fusing observations from external stationary cameras in the environment. We extend the single-view pose distribution tracking framework (PoseRBPF) to fuse observations from external cameras. We model the object pose posterior as a multi-modal distribution and introduce techniques for fusion, re-sampling, and pose estimation from the tracked distribution to effectively handle noisy and conflicting observations from different cameras. To evaluate our framework, we also contribute a real-world benchmark dataset. Our experiments demonstrate that the proposed framework yields a more accurate quantification of object pose and associated uncertainty than previous research. Finally, we apply our framework for pre-grasp planning on mobile robots, demonstrating its practical utility.

1 Introduction

Inspired by the success of mobile logistic robots in warehouses, mobile manipulation (MM) is also making strides toward solving various day-to-day tasks, like assisting cafeteria staff in cleaning cafeteria tables (Garcia-Haro et al., 2020) and handling and transporting supermarket products (Zia Ur Rahman et al., 2022). The success of such robots largely depends on two factors: time efficiency and robustness (Hvilshøj et al., 2012; Thakar et al., 2023; Sereinig et al., 2020). For example, in the case of a cafeteria cleaning robot, tables must be cleaned quickly so that they can be made available for new customers. Failure to grasp can have serious consequences, such as spilling leftover coffee from a mug, damaging the object, or even disturbing other objects in the vicinity, thereby making it difficult for the robot to complete the task.

Pre-grasp planning can improve both the time efficiency and robustness of the mobile manipulation (Naik, 2024). For example, i) by determining the optimal base poses for grasping objects (Naik et al., 2024b; Naik et al., 2024a; Jauhri et al., 2022a; Makhal and Goins, 2018; Vahrenkamp et al., 2013), grasping time can be optimized, or ii) by evaluating whether it is safe to attempt a grasp, grasping failures can be reduced (Naik et al., 2025; Shi et al., 2021). Like grasp planning for rigid objects, pre-grasp planning often requires the 6D pose estimate of the objects. Pose estimates are usually obtained using the robot’s onboard camera; if the robot can track the object poses while navigating to the manipulation scene, it can already start pre-grasp planning. However, it is not always possible to have a good view of the objects while approaching, which can make it difficult to track the poses.

Modern indoor environments are often equipped with cameras for various purposes, such as occupancy tracking, surveillance, smart office solutions, and more. These cameras often have an unoccluded overview of the objects. However, due to their greater distance from the objects, they also capture limited pixel information, making it difficult to rely solely on them for accurate 6D pose estimation. Furthermore, for pre-grasp and grasp planning, poses are required in the robot base frame. Transforming poses estimated in the external camera frames to the robot base frame introduces additional uncertainty due to the inherent uncertainties associated with mobile base localization. Nevertheless, these external cameras have the potential to enable better 6D object pose tracking when used in combination with the robot camera.

Relying on a single best guess of the object pose is always risky, as the accuracy of estimated object poses is influenced by factors such as occlusions, object symmetry, self-occlusions, and lighting conditions (Xiang et al., 2017; Tremblay et al., 2018; Shi et al., 2021). This risk can be reduced by also maintaining and exploiting the uncertainty in the estimate, which is usually expressed as a 6D object pose distribution (Manhardt et al., 2019; Deng et al., 2021). For example, uncertainty can be used to assess the probability of the availability of inverse kinematics (IK) solutions at the estimated base pose for grasping by projecting the pose distribution onto the base pose space (Figures 1a-i and ii). Similarly, uncertainty can be utilized to evaluate the probability of a successful grasp by mapping the pose distribution onto the grasp pose space (Figure 1b-iii and iv).

FIGURE 1

Robotic work environment with labeled cameras and tables of items. A robot arm with a camera is positioned near various objects like sauce bottles and cereal boxes. Insets show close-ups: (i) a mustard bottle, (ii) an illustration of a table with directional arrows, (iii) another view of the bottle, and (iv) a schematic of grippers around the bottle.

Multi-camera setup for pre-grasp planning. (a) Determining whether the robot should navigate to the selected base pose for grasping. (b) Determining whether the robot should attempt grasp using the selected grasp pose. a-i) Object pose uncertainty from distance. a-ii) Object pose uncertainty projected onto the base pose space. b-iii) Object pose uncertainty after navigating to the base pose for grasping. b-iv) Object pose uncertainty projected onto the grasp pose space.

We propose a multi-view 6D object pose distribution tracking framework that, in addition to the robot camera views, also fuses views from the external cameras in the environment for pose distribution tracking. Specifically, we extend the PoseRBPF (Rao–Blackwellized Particle Filter) (Deng et al., 2021) to utilize information from the external cameras.

Ideally, fusing observations from external cameras should improve the accuracy of the estimated pose distributions. However, multi-camera fusion also introduces several challenges. For example, how do we ensure that during fusion, good observations from some cameras are not corrupted by poor observations from others? and, how do we determine which observations to trust when cameras have contradictory observations? Moreover, a single object pose is required for robotics tasks. However, deriving a single pose from multimodal pose distributions is not trivial. Finally, there are no benchmarks for evaluation that reflect a multi-camera setup for pre-grasp planning (Figure 1a), where the external cameras are positioned far away from the objects while the robot camera views are expected to improve as the robot moves closer to the objects.

We extend the multi-view object pose distribution tracking framework for pre-grasp planning initially introduced in

Naik et al. (2022)

to address these challenges. In

Naik et al. (2022)

, multi-view fusion was performed by taking the product of observation likelihoods from all camera views without considering potentially unreliable observations from certain cameras. In addition, the standard re-sampling and pose estimation techniques proposed in

Deng et al. (2021)

were adopted, which did not consider the additional complexities introduced by the multi-camera setup for pre-grasp approach in mobile manipulation. Finally, the framework was evaluated only on a simulated dataset without any real-world evaluation. In this study, we address these limitations through the following contributions.

  • To effectively deal with poor observations and conflicting evidence, we propose:

    • multi-view exclusion—a failure model to determine which camera views should be considered during fusion;

    • adaptive re-sampling—a re-sampling strategy to avoid premature convergence to a particular hypothesis before sufficient evidence is obtained from all camera observations;

    • pose estimation from tracked distribution that maintains multiple pose hypotheses when sufficient temporal and spatial evidence is not available to confidently converge on a single pose estimate;

  • Real-world benchmark dataset and evaluation. We contribute a novel “multi-view YCB object pose tracking dataset for mobile manipulation (MY-MM)” and conduct an extensive evaluation on this real-world dataset.

  • Application. We exemplify the importance of maintaining object pose uncertainties during pre-grasp planning on mobile robots for the tasks shown in Figure 1, evaluating:

    • the availability of valid IK solutions for grasping the object using the estimate of the selected base pose;

    • grasp success using the estimate of the selected grasp pose.

1.1 Related research

In this section, we summarize related research in 6D object pose estimation, pose distribution estimation, and multi-view pose estimation, and then we discuss our contribution.

1.1.1 6D object pose estimation

Before the advances in deep learning (Krizhevsky et al., 2012), 6D object pose estimation was often addressed using template-based (Choi and Christensen, 2012b; Lim et al., 2014; Hinterstoisser et al., 2011) or feature-based methods (Drost et al., 2010; Choi and Christensen, 2012a; Collet et al., 2011). In the past decade, research focus has shifted to using data-driven learning-based approaches. Initial work focused on end-to-end regression from image to 6D pose (Kendall et al., 2015). Later studies have tried to better deal with object symmetry and occlusions using novel architectures, loss-function (Xiang et al., 2017), solving orientation as classification (Kehl et al., 2017; Corona et al., 2018), or using keypoints (Pavlakos et al., 2017; Rad and Lepetit, 2017; Peng et al., 2019; Haugaard and Buch, 2022). Furthermore, since robotics tasks such as grasping require very high accuracy, iterative methods have been also developed (Li Y. et al., 2018; Manhardt et al., 2018; Wang et al., 2019; Zakharov et al., 2019) for refining the 6D object poses. However, all these methods only provide best-guess pose estimates without any information regarding associated uncertainty.

1.1.2 Object pose distribution estimation

Pose uncertainties are traditionally modeled using unimodal distributions, such as Gaussians. However, given that pose distributions are often multimodal, it is preferable to employ multimodal distributions, such as a mixture of Gaussians or histograms.

Most existing research has focused solely on estimating rotational distribution (Okorn et al., 2020; Murphy et al., 2021; Iversen et al., 2022), assuming that good translation estimates are available. Research involving the estimation of a full 6D object pose distribution typically either allow multiple pose predictions from the network (Manhardt et al., 2019) or separately model translation and rotational distributions (Glover et al., 2012; Deng et al., 2019; Deng et al., 2021). Translational uncertainty is usually modeled by a simple Gaussian, while rotational uncertainty is modeled using a Bingham mixture model (Glover et al., 2012; Okorn et al., 2020) or a multi-modal histogram distribution (Corona et al., 2018; Deng et al., 2019).

Moreover, most such studies do not directly predict the object pose. They infer the pose either by comparing the object image against the rendered images of the object (Corona et al., 2018), using a learned comparison function against image dictionaries (Okorn et al., 2020), or by codebook matching and training a de-noising auto-encoder for generating codes (Sundermeyer et al., 2018; Deng et al., 2021). Others have represented the distributions implicitly, using a neural network that estimates the probability of a candidate pose given the input image (Murphy et al., 2021; Haugaard et al., 2023).

In this study, we separately model translational and rotational distributions and use codebook matching to determine the likelihood of different rotations.

1.1.3 Multi-view object pose estimation

Multi-view pose estimation involves estimating the 6D pose of an object by considering multiple views of it (Collet and Srinivasa, 2010; Susanto et al., 2012; Kanezaki et al., 2018; Li C. et al., 2018; Xie et al., 2019; Labbé et al., 2020). Sequentially integrating additional views of an object over time for pose estimation is referred to as “object pose tracking.” This includes object-based SLAM methods (Salas-Moreno et al., 2013; Mendes et al., 2016) or active vision methods (Zeng et al., 2017; Kaba et al., 2017; Deng et al., 2020; An et al., 2024) that capture multiple views of an object and improve the pose estimate over time. Some recent research has also concentrated on 6D object pose tracking in a manipulation context (Deng et al., 2019; Wang et al., 2020; Wen et al., 2020).

Other approaches concentrate on simultaneously using multiple views of an object from various camera viewpoints. Some of these methods jointly estimate the 3D pose from multiple object views (Helmer et al., 2010; Su et al., 2015; Kanezaki et al., 2018), others first generate pose estimates in each view and then employ various techniques to fuse multiple pose estimates into one (Collet and Srinivasa, 2010; Susanto et al., 2012; Li C. et al., 2018; Xie et al., 2019; Labbé et al., 2020), while recent work (Haugaard and Iversen, 2023) has used learned 2D–3D correspondence distributions for multi-view pose estimation.

However, not much prior research exists in multi-view full-pose distribution estimation. Erkent et al. (2016) have introduced an approach for integrating the 6D pose estimates from different views by assuming Gaussian uncertainty in each view. As discussed above, a unimodal distribution is not sufficient for modeling symmetric or occluded objects. They also assume only stationary cameras and do not incorporate any motion model for tracking the estimated pose distribution, which is essential for manipulation tasks. To our knowledge, there is no existing research that explores multiple simultaneous views along with sequential temporal integration to estimate multimodal object pose distributions.

2 Materials and methods

2.1 Problem definition

We define the problem as 6D object pose distribution tracking using the robot’s onboard camera and external cameras in the environment for pre-grasp planning on mobile robots. Let

,

,

, and

be the robot camera, external camera

, object, and world frame at time

, respectively.

refers to the transformation from frame

to

. Then, given:

  • the transformation of each camera w.r.t the world frame ,

  • the motion of the robot camera ,

  • a textured CAD model of the object,

  • an initial (at ) object detection (2D bounding box),

  • observations (RGB-D images) from external cameras and the robot camera at time .

Our objective is to estimate and track the object pose distribution (where ) by fusing observations from external cameras in addition to the robot camera and subsequently infer pose estimate from the tracked distribution.

Our approach is inspired by the single-camera 6D pose distribution tracking framework PoseRBPF (Deng et al., 2021). We extend PoseRBPF to the pre-grasp planning multi-camera setup (Figure 1a). In the subsequent sections, we briefly introduce PoseRBPF, including relevant mathematical details for multi-view extensions (Section 2.2), and then present the proposed multi-view extensions (Section 2.3).

2.2 PoseRBPF

The posterior distribution of the 6D object pose given the observations up to time can be expressed as , where is a 6D state variable and refers to observations from time 1 to . The 6D state variable consists of rotational and translational components. Since dense sampling in is computationally intractable, PoseRBPF models the rotational component as conditioned on the translational component as described in Equation 1.In addition, particles are sampled only over the translation space marginalized over the rotation (Rao–Blackwellization) (Murphy and Russell, 2001; Doucet et al., 2001). This approximation allows efficient sampling in the 6D space.

To effectively capture the object’s symmetry and visual ambiguity arising from occlusions, the rotation conditioned on the translation is modeled as a 3D histogram distribution. Each bin in the histogram represents the discretized rotational component of the object pose at time , with , , and denoting the elevation, azimuth, and bank indices, respectively. Discretization is performed with a resolution of 5°. As a result, the histogram distribution consists of 37 × 72 × 72 (elevation, azimuth, bank) bins. The probability of is , where , , and .

Thus, the th particle at time is represented by a translation and a discretized rotation distribution conditioned over the translation :

The posterior distribution at time is approximated by a set of particles expressed as

During initialization, the translational component of the particles are sampled around the available translation estimate of the object, assuming Gaussian uncertainty. The discretized rotation distribution conditional on the translation for each particle is initialized as a uniform distribution.

At each time-step , the set of particles from the previous time-step are propagated to the current time-step using a motion model. The motion propagation involves sequentially applying transformations: (inverse of the motion of the robot camera frame), (transformation from the previous robot camera frame to the previous object frame), and (motion of the object frame).

The observation likelihood is measured using the degree of agreement between observation and the propagated particles . A codebook of vector embeddings is pre-computed for the discretized rotation space of the object using a de-noising auto-encoder (Sundermeyer et al., 2018). The pre-computed encodings from the codebook for different rotations of the object are then compared with the encodings obtained from the current observation .

As each particle consists of a translation and a discretized rotation distribution conditioned on the translation (Equation 2), the observation likelihood for each particle is a histogram, with each bin containing the likelihood of the discretized rotation.

The importance factor (weight) of an individual particle is directly proportional to its observation likelihood (Thrun, 2002).

The weight for each particle is calculated by marginalizing over the observation likelihood for rotation samples, as shown in Equation 5:

Re-sampling involves increasing the number of relevant particles—that is, the particles with higher weights are more likely to be duplicated, and those with lower weights are more likely to be discarded (Thrun, 2002). After the re-sampling step, a multi-set of important particles is obtained.

The translational component of these particles represents the posterior translation distribution at time . To obtain the posterior rotation distribution , the individual observation likelihoods of each particle are marginalized. The posterior rotational distribution with temporal fusion up to time is obtained using Bayes’ rule.

Finally, the 6D pose estimate of the object from the distribution is computed. The translation estimate of the object pose at time is obtained by taking an average of the posterior translation distribution. The rotation estimate is obtained by taking the weighted average of the samples in the neighborhood of a rotation estimate at the previous time step after applying the motion from times to .

2.3 Proposed approach

In this section, we first provide a brief overview of the proposed multi-view pose distribution tracking framework using Figure 2 and then present our contributions in detail.

FIGURE 2

Flowchart illustrating a process across multiple steps. Step t-1 involves posterior distribution and motion propagation, showing overlapping colored circles with formulas. Step t involves camera transformations and observations with images and codebook processes leading to multi-view fusion. Concluding with re-sampling for step t+1, with a streamlined depiction of data processing and formula application through interconnected blocks and arrows.

Rao–Blackwellized particle filter proposed by Deng et al. (2021) extended for multi-view 6D pose distribution tracking.

2.3.1 Overview

Figure 2 describes the architecture of the proposed multi-view 6D object pose distribution tracking framework for a three-camera setup comprising a robot camera and two external cameras. The posterior distribution is modeled as a multi-modal distribution (Section 2.3.2). The particles approximating the posterior distribution at time (Figure 2a) are propagated to time using the motion model (Figure 2b; Section 2.3.3). Observation likelihoods for these propagated particles are then computed for each camera (Figures 2c,d), as proposed in PoseRBPF. Next, the observations from different cameras are fused (Figure 2e; Section 2.3.4). Each particle is then weighed based on the fused observations, followed by adaptive re-sampling (Section 2.3.5). The posterior distributions at time are obtained using temporal fusion. The 6D pose is then estimated using the pose distribution obtained (Section 2.3.6), and the re-sampled particles are advanced to the next time-step.

It should be noted that the robot camera frame is used as a reference frame, and both object pose distribution and pose estimate are estimated in this frame.

2.3.2 Multi-modal posterior distribution

As each camera can have different translation and rotation beliefs, we model both translation and rotation posterior distributions as multi-modal distributions. We assume that the uncertainty in the translation follows a Gaussian distribution in each camera. By combining the Gaussian distributions from all camera frames, the resultant posterior translation distribution is modeled as a mixture of Gaussians, as shown in Equation 6:where is the number of camera frames, is the normal distribution with the parameters (mean) and (co-variance matrix), and is a weight for the particular camera.

During the initialization of the tracking (at ), the translation component of the particles are sampled around the translation estimate in each camera obtained either from an object detector (Wu et al., 2019) or an alternate pose estimator (Xiang et al., 2017; Wang et al., 2019). In the case of an object detector, the bounding box center and corresponding depth (in the camera depth image) is used as a translation estimate. The particles are then sampled around these estimates assuming Gaussian noise with standard deviation in each camera . If more than one camera provides an initial translational estimate , then the resultant posterior translation distribution is a mixture of Gaussians. For the three-camera setup in Figure 2, the posterior translation distribution is represented using three Gaussians (Figure 2a). The discretized rotation distribution conditional on the translation for each particle is initialized as a uniform distribution similar to PoseRBPF.

As the camera from which a particle originated at is required during the re-sampling step (Section 2.3.5), we represent the particles in (Equation 3) originating from camera using the set . For cameras, there exists such particle sets , where each is a subset of with size and . The total number of particles in particle set is kept constant, so that at any time , . This is ensured during the re-sampling step.

2.3.3 Motion propagation

The particles originating from the robot camera frame are propagated as described in Equation 4 (same as PoseRBPF). The motion propagation for particles originating from the external camera frames (as described in Equation 7) involves sequentially applying transformations: (transformation from the current robot camera frame to the current external camera frame), (inverse of the motion of the external camera frame), (transformation from the previous external camera frame to the previous robot camera frame), (transformation from the previous robot camera frame to the previous object frame), and (motion of the object frame):

External cameras and objects are assumed stationary, and hence their motion from times to is 0 (i.e., and are identity matrices). The uncertainty of the transformation from the external to the robot camera is modeled as Gaussian noise. The uncertainty of the relative motion of the robot camera is modeled as Gaussian noise with dispersion proportional to the relative motion (Thrun, 2002).

At the end of the motion propagation step, a new particle set is obtained, which contains the updated particles from the previous time step after applying the motion. Each particle in with its translation and rotation distribution conditioned on the translational component is propagated from times to , resulting in and .

NOTE: In the remaining sections, frames are represented by integers ranging from 1 to for simplification. Specifically, the robot camera frame is denoted by 1, while the external camera frames are represented by integers from 2 to . Additionally, when a translation or rotation variable is not explicitly associated with a specific camera frame, it should be assumed to be in the robot camera frame, such that both represent the translation estimate in the robot camera frame .

2.3.4 Multi-view fusion

As the observation likelihoods 1 obtained for each particle in camera are assumed to be conditionally independent of each other, the joint observation likelihood can be obtained by fusing the individual observation likelihoods (Durrant-Whyte and Henderson, 2016) as described in Equation 8:

The observation likelihoods from different cameras are first transformed into a common reference frame. The likelihoods of corresponding bins across views are then fused to obtain the resultant likelihood. While this ensures consistency across views and mitigates potential scale bias, such a fusion is unable to handle erroneous observations (Khaleghi et al., 2013). For instance, if the object is not visible in one of the cameras (i.e., empty histogram), the fused observation likelihood will also be an empty histogram.

To address this issue, additive/Laplace smoothing (Schütze et al., 2008) can be performed on the individual observation likelihoods before the fusion. However, if camera has a poor view of the object (for example, due to occlusions), the observation likelihood obtained for particle in camera frame will most likely be inaccurate. When combined with good observations from other cameras, such an observation can corrupt the overall fusion result.

We address this by defining a failure model to assess whether an observation from a particular camera should be included in the fusion process (Khaleghi et al., 2013). Let represent the value of maximum likelihood for the observation over all possible translations and rotations in the particle set.

It is shown in Deng et al. (2021) that is correlated with the degree of object occlusion. If falls below 0.6, which corresponds to approximately 65% of the object being occluded, then the object can no longer be reliably tracked. We use this insight to define the failure model to determine whether observation likelihoods for particles in camera (Figure 2d) should be included in the fusion (Figure 2e). The observation likelihood for particle in camera frame is included during fusion only if it is greater than pre-defined threshold value .

Observation likelihoods in different frames are fused together (Algorithm 1). Prior to fusion, observation likelihoods from different frames are transformed into a robot camera frame, and additive/Laplace smoothing is applied by adding a small minimum likelihood to all rotations in the discretized rotation space. The observation likelihood from frame is included in the fusion if exceeds .

Algorithm 1

  • 1: 

  • 2: 

  • 3:

  • 4: 

  • 5: for in do

  • 6:  ifthen

  • 7:   transform to

  • 8:   

  • 9:  end if

  • 10: end for

  • 11: 

At the end of this step, the joint observation likelihood is obtained for each particle in the particle set (Figure 2e).

Note that in the rest of this study, for brevity, are represented as .

2.3.5 Adaptive re-sampling

As described in Section 2.3.4, the decision to include observation likelihood of particle in camera during the fusion is based on the (Equation 9). Moreover, as explained in Section 2.3.2, at , for each camera , with the translation estimate of the object , number of particles are sampled around this translation estimate assuming Gaussian distributed noise. These particles, denoted by the set , represent the observation of camera at . Prior to re-sampling at time , these particles are denoted using the set . Consequently, before the filter converges, the particles in the set are more likely to have the maximum likelihood for the camera’s th observation compared to other sets of particles that represent other camera observations.

Conventional re-sampling techniques do not distinguish between particles representing different camera observations. Thus, if camera has a poor observation at time (before convergence) due to motion blur, partial occlusion, and so on, particles representing the observations of camera will be more affected as they will have smaller weights than other particles. As a result, there is a possibility that these particles may be eliminated during re-sampling.

Consequently, there is an increased possibility that observations from camera may not have a sufficiently high during subsequent time-steps to be included during the fusion, as all the particles representing its observation were eliminated. As a result, the particle filter may prematurely converge to an inaccurate hypothesis by solely considering observations from other cameras and disregarding observations from camera .

Separately re-sampling the particles in each in Equation 3 can ensure that all particles from each of the sets of particles are retained throughout the tracking process. However, this can lead to a situation where the set of particles with a very poor initial translation estimate are also retained forever. Such uninformative particles would only waste computational resources without contributing to improved tracking.

To address this issue, we dynamically adjust the size of the particle sets based on the average aggregate weight of the particles within these sets. We increase the size of the particle sets with the highest average aggregate weight and reduce the size of the other particle sets to maintain a constant number of particles at time . Particles are then individually re-sampled from each of these sets. Thus, we prevent the elimination of particles originating from certain cameras due to a few poor observations in that camera; however, if a camera has consistently poor observations, then these particles are eventually eliminated.

In the rest of this section, we provide technical details of our approach. Readers can optionally proceed directly to Section 2.3.6 without these details.

We refer to our approach as “adaptive re-sampling” (Algorithm 2). It takes (the particle set obtained at the end of motion propagation step—Section 2.3.3) and (the set of weights associated with particles in ) as inputs and returns the new particle set containing particles (Lines 1–2). contains the re-sampled indices of the particles in set that are used for creating new particle set .

We let return the indices of the particles in the particle set that originated from the camera observation at = 0 and return the weights of these particles

Furthermore, “bpsi” (best particle set index) is the index of the with the highest average aggregate particle weights (Line 4), while “bpibps” (best particle index in the best particle set) is the index of the particle with maximum weight in the best particle set (Line 5).

The size of each of the particle sets is dynamically adjusted based on the average aggregate weight of the particles in these sets (Lines 6–12). The size of the particle sets with lowly weighted particles is reduced, while the size of the particle set with the highest average aggregate weight of particles is increased. The particle with the maximum weight from is duplicated −1 times. Particles with the lowest weights from other particle sets (“wpipsn”—worst particle index in particle set ) are eliminated to ensure that remains constant. Finally, particles from each are separately re-sampled (Lines 13–18). This incremental process of replicating the “best” particle followed by re-sampling at each time-step prevents sudden particle impoverishment and helps maintain multiple hypotheses, thereby reducing the risk of mode collapse.

Algorithm 2

  • 1: 

  • 2: 

  • 3:       ⊳ Multi-set (can contain duplicates)

  • 4:  =

  • 5:  =

  • 6: fordo

  • 7:  if not equal to then

  • 8:   

  • 9:   

  • 10:   

  • 11:  end if

  • 12: end for         ⊳ Re-sampling

  • 13: fordo

  • 14:  fordo

  • 15:   draw from with prob.

  • 16:   

  • 17:  end for

  • 18: end for

  • 19: 

  • 20: 

By adopting this strategy, we prevent premature convergence and allow the maintenance of a diverse set of particles for a specific duration so that sufficient temporal evidence is obtained from all cameras before convergence. This duration depends on the number of particles in each of the particle sets .

2.3.6 Pose estimation from tracked distribution

For the multi-camera setup considered here, initially at , only uncertain pose estimates are available due to the large distance between the cameras and the objects. Hence, the assumptions such as having a good initial object rotation estimate (at ) in Deng et al. (2021) are not always valid. To address this, we maintain multiple pose hypotheses when sufficient evidence is not available in the tracked distribution to converge to a single pose estimate. In the following sections, we describe our approach for estimating the translation and rotation components of using the distribution .

2.3.6.1 Translation estimate

The posterior translation distribution is modeled as a mixture of Gaussians with (number of particles) samples. This distribution in general reduces to a unimodal Gaussian distribution as the filter starts to converge. The translation estimate of the object pose at time is obtained by taking a weighted average of the particles of the dominant Gaussian (i.e., the Gaussian with the highest associated weight) in the posterior translation distribution.

2.3.6.2 Rotation estimate

The posterior rotation distribution is modeled as a potentially multimodal histogram distribution. While weighted average can be employed to compute the rotation estimate when the histogram distribution is unimodal, it can yield an incorrect estimate when it is multimodal.

We maintain multiple rotation estimates when sufficient evidence is not present in the posterior rotation distribution to converge to a single pose estimate. Thus, we maintain and track a set of rotation hypotheses as described in Equation 10,where is the rotation hypothesis and is the associated confidence.

In the subsequent sections, we present our approach. In Section 2.3.6.2, we introduce the notion of supporting samples, and we outline the algorithm for maintaining a set of rotation hypotheses in Section 2.3.6.2.

2.3.6.2.1 Supporting samples

Supporting samples are the discretized rotational components in the posterior rotational distribution that can be associated with the rotation hypothesis . A discretized rotational component is considered to be part of the supporting samples if it lies in the neighborhood of (the rotation hypothesis after applying the camera motion from −1 to ) and has a probability greater than :where is a threshold angular difference to determine whether the discretized rotational component is in the neighborhood of the , and is the minimum probability that a discretized rotational component must be part of .

Figure 3 illustrates the computation of supporting samples for three different rotation hypotheses in a set (Figure 3a). Initially, the posterior rotation distribution is obtained at time (Figure 3b), and the rotations with probabilities greater than are extracted (Figure 3c). This is followed by applying the motion from time to to all the rotation hypotheses to obtain (Figure 3d). Finally, the supporting samples are obtained for each rotation hypothesis, as in Equation 11 (Figure 3e).

FIGURE 3

Diagram illustrating a process flow. Panel (a) shows vectors \( r_{1,t-1}, r_{2,t-1}, r_{3,t-1} \). Panel (b) features a 3D grid sphere with a highlighted area representing \( P(R_{t} | Z_{1:t}) \). Panel (c) contains orange vectors pointing in different directions. Panel (d) shows updated vectors \( \bar{r}_{1,t}, \bar{r}_{2,t}, \bar{r}_{3,t} \). Panel (e) displays subdivisions \( S_{1,t}, S_{2,t}, S_{3,t} \) each with orange vectors \( \bar{r}_{1,t}, \bar{r}_{2,t}, \bar{r}_{3,t} \). Blue arrows indicate flow connections.

Illustration for determining supporting samples. (a) Set of rotation hypothesis at time (b) Posterior rotation distribution (c) Rotation samples in with probabilities greater than or equal to (d) Transformed rotation hypothesis from after applying motion from to (e) Supporting samples at time for rotation hypothesis in

It is assumed that the rotation hypothesis can be associated with the posterior rotation distribution if more than % samples with probabilities greater or equal to in are in . For example, in Figure 3, most of the samples with probabilities greater than in the posterior rotation distribution are in the neighborhood of . Hence, if is set to 0.8 (80%), only can be associated with the samples in the posterior rotation distribution at time .

2.3.6.2.2 Maintaining a set of rotation hypotheses

If the samples in the posterior rotation distribution can be associated with the rotation hypothesis , then is updated using the expectation of the supporting samples. However, if the samples in the posterior rotation distribution cannot be reliably associated with the rotation hypothesis , then is split into two hypotheses at time —one suggested by the data that could be associated with and the other suggested by the data that could not be associated with (Thrun, 2002). The algorithm for maintaining a set of rotation hypotheses is described in Algorithm 3.

Initially, the rotation hypothesis under consideration is updated using the camera motion from times to , resulting in (Line 6). Subsequently, the supporting samples are determined, and if the rotation hypothesis can be associated with the posterior rotation distribution , is updated by taking the expectation of the supporting samples (Lines 6–10).

If cannot be associated with the , a new rotation hypothesis is introduced (Lines 12). The confidence in this hypothesis is computed relative to the confidence in the parent hypothesis , similar to multi-hypothesis tracking in Thrun (2002). Finally, the confidences in all the rotation hypotheses are normalized (by dividing by the sum of all confidences). Only the rotation hypotheses with confidence greater than are retained for the subsequent time-step (Line 18). The maximum number of hypotheses retained at any time step is given by .

Algorithm 3

  • 1: 

  • 2: 

  • 3:    ⊳ For simplification purpose

  • 4: 

  • 5: fordo

  • 6:     ⊳ Using motion from to

  • 7:     ⊳ Equation 11

  • 8:     ⊳ Using

  • 9:  ifthen

  • 10:   

  • 11:  else

  • 12:   

  • 13:   

  • 14:   

  • 15:  end if

  • 16: end for

  • 17: 

  • 18: 

Algorithm 3

thus covers two possible failure cases. The association could not be established between the rotation hypothesis

and the posterior rotation distribution

because

  • is a noisy distribution and the rotation hypothesis (transformed after applying the camera motion) is a valid estimate;

  • (transformed after applying the camera motion) is an invalid estimate but is a valid distribution.

In the first case, the rotation hypothesis remains in the set , and confidence in it increases, while confidence in the rotation hypotheses added based on the noisy samples in (that could not be associated with ) decreases as more valid distributions are acquired in subsequent time steps.

In the second case, confidences in the rotation hypotheses added based on the samples in the posterior rotation distribution which could not be associated with hypothesis increases, and simultaneously, confidence in hypothesis decreases as more similar samples are obtained in the subsequent time-steps.

At time , the rotational hypothesis with the highest associated confidence is used as the rotational estimate .

3 Results

3.1 Experimental evaluation

3.1.1 Experiment objectives

The experiments are designed to verify that the proposed multi-view object pose distribution tracking using a robot and external cameras results in more accurate

  • object pose estimation than existing multi-view pose estimation methods;

  • uncertainty quantification than existing single-view uncertainty quantification methods;

  • object pose estimation and uncertainty quantification compared to using individual robot or external cameras alone.

Since our focus is on household objects, we use YCB benchmark (

Calli et al., 2015

) objects for evaluation. We consider the following baselines.

  • PoseRBPF (Deng et al., 2021): a single-camera object pose distribution tracking framework that provides both pose estimates and pose distributions at each time-step.

  • CosyPose (Labbé et al., 2020): a multi-camera pose estimation method. It first estimates the object poses in each camera and then matches individual pose hypotheses across different input images to jointly estimate camera viewpoints and 6D object poses in a single consistent scene.

We also conducted three ablation studies to validate our design choices for multi-view fusion, re-sampling, and pose estimation from a tracked distribution. Finally, we present an ablation study to compare the performance with different camera combinations for fusion.

3.1.2 Experiment setup

As the mobile robot approaches the object for grasping using its base and arm motion, the robot camera will initially have a very limited view of the object, which will improve as it moves closer to the object. On the other hand, static external cameras will always have the same view of the object. Additionally, there is a possibility that as the robot approaches the object, it may be occluded by other objects in the scene. In such cases, the robot is expected to utilize its motion to obtain a better view of the object.

Existing real-world object pose estimation benchmark datasets, such as the YCB Video dataset (Xiang et al., 2017), primarily consist of short single-camera video sequences without significant changes in the object’s view. The camera is positioned very close to the objects, simulating scenarios where the robot is already in close proximity to the object for grasping (i.e., assuming sequential navigation and manipulation). Furthermore, in multi-view datasets such as T-LESS (Hodaň et al., 2017), the object appears at an equal distance across all the views. Consequently, these datasets are not well-suited for evaluating this research. To address this gap, we contribute a “multi-view YCB object pose tracking dataset for mobile manipulation” (MY-MM).

3.1.2.1 MY-MM dataset

The dataset comprises 100 sequences featuring ten different YCB objects with distinct geometries and textures (Figure 4). For each video sequence (Figure 5), we provide the 6D pose of the object of interest in the robot camera frame, views of the scene from two external cameras, the transformations between all cameras for each time step, and intrinsic calibration for all the cameras. For each object, we provide ten sequences with ground truth poses. In each scene, along with the object with ground truth poses, several distractor objects are also present (Figure 5). Full robot camera videos along with the recorded ground truths for the sequences in Figure 5 can be viewed here2.

FIGURE 4

Various kitchen and household items labeled on a table. Items include a can of Maxwell House coffee, Cheez-It crackers box, Domino sugar box, French's mustard bottle, a can of Spam, a bottle of Soft Scrub bleach cleanser, a banana, an extra-large clamp, a red mug, and a red bowl. Each item is labeled with a number and descriptor.

Subset of ten YCB objects used in the MY-MM dataset.

FIGURE 5

Four sets of images show objects being tracked on a table with time-lapse views from a robot camera and external camera. The objects include a red mug, a can of Spam, mustard, and a box of Cheez-It. Each set progresses with increasing time steps, depicting different positions and arrangements of items around the tracked object.

External and robot camera views at different time steps for four different sequences in the MY-MM dataset.

The external cameras are positioned approximately 3 m from the objects (Figure 1). The robot camera moves toward the object, starting from 2.5 m away and approaching to approximately 0.5 m from the object, recording approximately 250 frames in the process. The robot camera views were recorded using the Intel RealSense D435 RGB-D camera at a resolution of 640 × 480. External camera views were captured using Intel RealSense D455 RGB-D cameras at full HD resolution. However, only the center crop of the captured image, sized 640 × 480, containing a view of the table has been included in the dataset. We thus also reduce possible calibration errors caused by fish-eye effects in the periphery of the images.

3.1.2.2 Ground truth annotation

Computing the ground truth 6D pose from a distance is challenging. We employed the setup depicted in Figure 6 to mimic the robot camera views while approaching the object for grasping and to annotate the object’s ground truth pose. First, the relative transformation between the YCB object origin frame and a marker origin frame was manually measured. To minimize the errors, markers were printed with object outlines, and the objects were carefully placed inside. This was followed by estimating the marker pose in the robot camera frame to estimate the relative transformation between these two frames. Since the transformations between the object frame and marker frame and a robot camera frame and robot base frame are already known, they can be used to compute object poses in the robot base frame . Moreover, the transformation between the robot TCP frame and the robot camera frame is also known. The estimated object ground truth is then manually validated by overlaying the object CAD model on the image using the estimated ground truth.

FIGURE 6

A robotic arm is positioned over a table with labeled coordinate frames. The robot's base frame and camera frame are indicated. Below, a red mug is marked with an object frame, alongside a checkered marker with a corresponding frame. Various items, including mustard, are visible on the table.

Setup for initially recording the transformation between the object and the robot TCP.

Each sequence is recorded as follows. Initially, the robot camera is positioned at a considerable distance from the object (Figure 7) at . At each subsequent time-step, the robot camera is moved closer to the object using only the manipulator motion. To ensure an accurate estimation of the object pose, the manipulator is completely stopped during each step to avoid any synchronization issues that may arise due to the motion. The transformation from the robot TCP to the robot base is recorded at each time-step. This recorded transformation and the pre-computed transformation from the robot base to the object are used to estimate the ground truth object pose. By following this procedure, the object pose with respect to the robot camera is progressively estimated as the camera approaches the object.

FIGURE 7

Six-panel sequence showing a robotic arm performing tasks at intervals of fifty units, labeled t=0 to t=250. The arm moves progressively closer to a table with a laptop, red cup, and papers. The setting is a plain, industrial room.

Simulating robot camera views while navigating toward the target object. Ground truth object pose at each time step is annotated by using the robot’s TCP pose at each time step and initially recorded transformation between the object and the robot’s TCP (Figure 6).

External camera views are obtained using cameras mounted on the wall (Figure 1). The extrinsic calibration of all the cameras, both robot and external, is initially performed with respect to a ChArUco marker.

3.1.3 Evaluation metrics

We evaluate pose estimation performance using ADD and ADD-S metrics as previously proposed (

Hinterstoisser et al., 2013

;

Xiang et al., 2017

;

Deng et al., 2021

). Given the ground truth rotation

and translation

, the estimated rotation

and translation

, the set of 3D model points

and the number of points

, we define the following.

  • ADD: The average distance computes the mean of the pairwise distances between the 3D model points transformed according to the ground truth and estimated pose as described in Equation 12.

  • ADD-S: For symmetric and partially symmetric objects such as a bowl, mug, and banana, the match between points is ambiguous for some views. Therefore, the average distance is computed using the closest point distance as described in Equation 13.

The 6D pose is considered correct if the average distance is smaller than a pre-defined threshold. We report area under the curve (AUC) of ADD and ADD-S (Labbé et al., 2020) for threshold ranging from 0 to 10 cm in steps of 1 cm.

To evaluate the object pose uncertainty quantification, we use the following metrics.

  • Log probability of ground truth translation: the log of the probability assigned to the ground truth translation in the estimate of the posterior translation distribution3.

  • Log probability of ground truth rotation: the log of the probability assigned to the ground truth in the estimate of the posterior rotation distribution .

These metrics assess whether the distribution accurately assigns probability mass to the ground truth translation and rotation. As the log is undefined for zero probability, the minimum probability values were clipped to 1e−5. Thus, the minimum value for the log probability of ground truth translation and ground truth rotation is .

3.1.4 Hyperparameters and implementation details

We used a three-camera setup (two external cameras and one robot: ) for evaluation (Figure 1). As noted in Deng et al. (2021), approximately 100 particles result in optimal performance. Hence, during initialization, 40 particles were sampled around object detection in each camera frame. Thus, the total number of particles was set to 120.

To initialize the particles, the translation estimate was determined using the center of the 2D bounding box of the object obtained using the Detectron2 (Wu et al., 2019) object detector4 and the corresponding depth in the depth image.

The particles were then sampled around these estimates, assuming normal distributed noise with a standard deviation of 20 pixels in the and estimates and 10 cm in the estimate (Equation 6). The same strategy was also employed to initialize the particles for the baseline. The standard deviations can be increased if more noise is expected in the translation estimates used for initialization. However, the larger the noise, the more particles are required to uniformly cover the translation space.

During the multi-view fusion, was set to 0.6, as it was observed that pose distributions in the particular camera frame were unreliable when fell below 0.65 (Deng et al., 2021). was set to 1e−5. is used for additive filtering and hence should be a small value. For computing observation likelihoods, de-noising auto-encoders were trained for each object as detailed in Deng et al. (2021). To identify supporting samples, was set to 30° and was set to 1e−5. and were used for associating the rotational hypothesis from the previous time-step to rotation distribution at time . Thus, for strict association, should be decreased and increased. To decide whether the rotational hypothesis should be carried over to the next time-step, was set to 0.1. The multiplicative inverse of implicitly defines the maximum number of rotation hypotheses at a given time step. Therefore, the smaller the value of , the larger the maximum number of rotation hypotheses that can be maintained. The unimodal or multi-modal nature of the posterior distributions was assessed using the Henze−Zirkler multivariate normality test (Henze and Zirkler, 1990).

All experiments were conducted on a workstation equipped with an Intel i9 processor, 32 GB RAM, and an NVIDIA GeForce RTX 3090 graphics card. The execution speed was approximately 4 Hz for the three-camera setup.

3.2 Results

In this section, we present the results of our experiments. We first provide a qualitative example in Section 3.2.1. Quantitative results that validate the objectives discussed in Section 3.1.1 are presented in Section 3.2.2.

3.2.1 Qualitative results

Figure 8 demonstrates the role of external cameras in preventing the particle filter from diverging, even when the object becomes severely occluded in the robot camera view. Initially , and the object is visible in the robot camera view. However, translation errors in the estimated object pose are notably high when using both the robot and external cameras (Figure 8a) and when using only the robot camera (Figure 8e). This is attributed to noisy depth images due to the considerable distance (approximately 3 m) between the camera and the object. However, rotational errors remain low in both scenarios (Figures 8b,f), since a unique view of the object is available in the robot camera during this period.

FIGURE 8

Robot camera views show various times, from t=0 to t=216, capturing a table with objects including a Cheez-It box. External camera views at t=216 reveal a wider setup. Two sets of graphs compare multi-view and single-view object pose tracking. Each set includes translational and rotational error graphs, alongside log probability of ground truth translation and rotation. The graphs illustrate changes over time steps, showing error measurements and probabilities across the tracking period.

Comparison of object pose distribution tracking using Ours and PoseRBPF (Deng et al., 2021). First row: Robot views at different time frames, external camera views, tracked object (cracker box). Second row: (a) translational error, (b) rotational error, (c) log-probability of ground truth translation and (d) log-probability of ground truth rotation for ours. Third row: (e) translational error, (f) rotational error, (g) log-probability of ground truth translation and (h) log-probability of ground truth rotation for PoseRBPF.

As the robot approaches the object, it becomes increasingly occluded in the robot camera view , resulting in an increase in errors in both scenarios. However, there is a noteworthy difference in how these errors evolve with and without the use of external cameras. In the case of multi-view (Figure 8b), the rotational error increases as the filter starts to rely on observations from external cameras which are more uncertain than robot camera views without occlusion. However, evidence provided by the external cameras during this period prevents the particle filter from diverging (Figure 8c), and as a result, it quickly converges once the object starts to re-appear in the robot camera view. Quick convergence and accurate quantification of uncertainty can enable the robot to make early and reliable decisions as it approaches the object for grasping. Conversely, while just using a robot camera, the filter starts to diverge (Figure 8g) due to noisy evidence provided by the robot camera during this period. As a result, there is a large increase in translational (Figure 8e) and rotational (Figure 8f) errors. Eventually, the filter converges at the end of the sequence (Figures 8e,f).

More qualitative results can be found here5.

3.2.2 Quantitative results

Table 1 presents pose estimation results using two baselines—PoseRBPF (Deng et al., 2021) and CosyPose (Labbé et al., 2020) —and the proposed framework. As PoseRBPF is a single-view pose estimation method, we present pose estimates in a robot and two external cameras. These single-view pose estimates were then used as an input to CosyPose for refinement6. For large objects with distinct features, external cameras result in better performance as they have an unoccluded view of the object as well as sufficient pixel information for pose estimation. The performance also depends on the view of the object, as the external cameras are stationary.

TABLE 1

AUC of ADD AUC of ADD-S
PoseRBPF CosyPose Ours PoseRBPF CosyPose Ours
Object Robot cam External camera 1 External camera 2 Robot + external cameras Robot + external cameras Robot camera External camera 1 External camera 2 Robot + external cameras Robot + external cameras
002_master_chef_can 22.48 40.73 20.10 24.94 46.93 69.68 85.02 81.59 77.91 84.48
003_cracker_box 59.22 65.07 19.79 65.04 62.29 79.00 87.20 72.94 82.77 82.76
004_sugar_box 45.76 08.30 21.05 36.88 47.69 64.32 58.22 39.54 52.67 77.55
006_mustard_bottle 43.42 29.78 50.27 48.59 43.04 74.56 85.05 67.04 75.02 80.84
010_potted_meat_can 09.22 28.34 19.39 10.96 33.13 24.53 65.28 36.82 25.59 60.97
011_banana 30.11 05.64 20.15 43.20 22.46 50.87 18.98 42.18 60.99 40.30
021_bleach_cleanser 39.55 09.82 04.69 39.63 55.73 64.85 46.46 33.94 60.05 77.51
024_bowl 19.16 23.68 24.59 25.14 37.46 35.17 62.86 65.27 39.73 64.53
025_mug 20.25 49.54 33.18 26.96 64.77 53.75 80.96 83.50 54.04 87.11
052_extra_large_clamp 34.35 11.42 10.61 37.91 27.35 53.04 28.54 37.40 55.03 48.57
All 32.35 27.23 22.38 35.92 44.08 56.97 61.85 56.02 58.38 70.46

Pose estimation results.

For colored objects ADD-S, metric is more relevant due their symmetry; for other objects ADD, metric is more relevant.

Bold indicates the best performing method.

CosyPose is able to slightly improve the pose estimates obtained using PoseRBPF in the robot camera by refining them using the pose estimates obtained using PoseRBPF in the external cameras. However, the difference is not significant, since the estimates obtained in the external cameras have large errors, especially for small objects.

Our framework achieves significantly better performance on objects such as the “006_mustard_bottle,” “010_potted_meat_can,” “021_bleach_cleanser,” “024_bowl,” and “025_mug.” This is because these objects are partially symmetric and have smaller sizes. As a result, during occlusion in the robot camera, they are either very severely occluded or have ambiguous views. The additional views from external cameras prevent the particle filter from diverging during occlusions when the robot camera has a poor view of the object. As a result, the particle filter quickly converges to the accurate pose estimate when the object once again has a good view in the robot camera. Without the additional views from the external cameras, the particle filter often needs to be re-initialized after occlusions (Deng et al., 2021). In the case of objects like “011_banana” and “052_extra_large_clamp,” the robot and external cameras often have contradictory observations that cannot be reliably captured by the failure model (Section 2.3.4); this leads to inaccurate distributions after fusion.

In Table 2, we present pose uncertainty quantification results using PoseRBPF (Deng et al., 2021) and the proposed framework. The proposed framework results in more accurate translation distributions, as indicated by the high log probability of ground truth translation. This is because, when sufficient evidence is not available, it prevents premature convergence and maintains multimodal translation distributions, while PoseRBPF incorrectly converges to inaccurate translation distributions. Preventing premature convergence is crucial for preventing the particle filter from diverging when the robot camera has poor views of the object.

TABLE 2

Object Log probability of ground truth translation Log probability of ground truth rotation
PoseRBPF Ours PoseRBPF Ours
Robot camera Robot + external cameras Robot camera Robot + external cameras
002_master_chef_can −3.05 −0.46 −5.88 −5.98
003_cracker_box −3.06 −1.56 −3.04 −2.98
004_sugar_box −4.37 −2.04 −6.04 −6.35
006_mustard_bottle −3.53 −1.95 −4.23 −3.63
010_potted_meat_can −10.57 −4.18 −8.63 −6.54
011_banana −5.08 −4.70 −5.49 −4.27
021_bleach_cleanser −4.90 −2.01 −5.41 −4.82
024_bowl −5.42 −0.97 −4.42 −3.96
025_mug −5.46 −0.31 −7.33 −4.35
052_extra_large_clamp −3.03 −0.95 −7.19 −5.81
All −5.02 −1.91 −5.76 −4.86

Pose uncertainty quantification results.

Bold indicates the best performing method.

Additionally, in the case of objects with distinct features and larger sizes such as the “003_cracker_box,” PoseRBPF has a slightly higher probability mass in the ground truth rotation. This is because if a robot camera has a good view of the object and a well-estimated translation, external cameras due to their large distance to the object can introduce noise during fusion. However, the use of external cameras results in much better rotation distributions for other objects that are partially symmetric and have smaller sizes.

3.2.3 Ablation studies

In this section, we first present three ablation studies to validate our design choices, followed by an ablation study on pose estimation and uncertainty quantification using different camera combinations for fusion.

3.2.3.1 Multi-view fusion

In this study, we validate the fusion approach outlined in Section 2.3.4. We compare our approach against two baselines. In the first, we directly fuse all the observations (similar to Naik et al. (2022)), while in the second baseline we perform additive filtering before fusion. Our approach (“Ours”), in addition to additive filtering, uses a failure model described in Section 2.3.4 to determine whether a particular observation should be included during the fusion. In the experiments, all particles at each time-step were injected using ground truth translations to ensure that design choices during the other steps in the particle filter did not affect the fusion. We measure the performance using the log of probability assigned to the ground truth rotation in the fused distribution. Results are presented in Table 3.

TABLE 3

Object Log probability of ground truth rotation
Fusion without additional filtering Fusion with additional filtering Ours (Section 2.3.4)
002_master_chef_can −14.26 −7.32 −7.08
003_cracker_box −14.74 −5.45 −4.74
004_sugar_box −16.53 −8.06 −5.48
006_mustard_bottle −13.34 −6.05 −3.46
010_potted_meat_can −16.41 −6.17 −3.59
011_banana −16.60 −5.61 −3.70
021_bleach_cleanser −16.60 −7.25 −4.03
024_bowl −15.70 −7.13 −6.93
025_mug −15.05 −5.60 −4.93
052_extra_large_clamp −16.60 −7.71 −6.07
All −15.52 −6.63 −5.00

Ablation study on multi-view fusion.

Bold indicates the best performing method.

Fusion without additive filtering results in poor performance (Table 3) because if the object is not visible in one of the cameras, observations from other cameras are also lost. Fusion with additive filtering addresses this and significantly improves performance as it retains good observations from other cameras before fusion. However, it does not effectively handle situations in which one of the cameras provides a poor view of the object, resulting in noisy observations. Fusing such noisy observations with additive filtering has the potential to corrupt good observations from other cameras. To mitigate this, we employ the failure model described in Section 2.3.4 to determine whether a particular observation should be included during fusion, which results in improved performance—as indicated by the high log probability of ground truth rotation in the right column of Table 3.

3.2.3.2 Re-sampling

In this study, we validate the re-sampling approach outlined in Section 2.3.5. We compare our approach against two baselines. In the first, we re-sample all particles simultaneously using low-variance re-sampling (Thrun, 2002; Deng et al., 2021; Naik et al., 2022). In the second baseline, we separately re-sample particles originating from different camera observations at , so that particles from all camera observations are retained throughout tracking. Our approach (“Ours”) dynamically adjusts the number of particles originating from different camera observations at based on the particles’ aggregate weight. Subsequently, these particles are separately re-sampled, as described in Section 2.3.5. The performance of all three approaches is measured by comparing the log probability of the ground truth translation in the resultant translation distribution.

Re-sampling all the particles simultaneously (first baseline) leads to improved performance for large objects with distinct features, as demonstrated in Table 4, compared to our approach and the second baseline. This is because the robot camera often provides high-quality observations of these objects since they are rarely fully occluded due to their large size, while external cameras tend to have comparatively noisy observations due to their greater distance from the object. For these objects, simultaneously re-sampling all particles quickly eliminates noisy particles from the external cameras, resulting in more compact translation distributions than our approach, where particles from all cameras are retained until sufficient temporal evidence is available or the second baseline, where particles originating from different frames are retained at all times.

TABLE 4

Object Log probability of ground truth translation
Re-sampling all particles simultaneously Separately re-sampling particles originating from different frames Ours (Section 2.3.5)
002_master_chef_can −0.80 −1.35 −1.67
003_cracker_box −2.08 −2.71 −2.63
004_sugar_box −2.81 −4.60 −3.74
006_mustard_bottle −1.32 −2.38 −2.62
010_potted_meat_can −6.67 −6.13 −0.42
011_banana −5.10 −6.92 −3.40
021_bleach_cleanser −5.79 −6.47 −2.68
024_bowl −1.30 −1.16 −0.54
025_mug −3.91 −1.94 −0.52
052_extra_large_clamp −2.92 −5.11 −2.18
All −3.27 −3.87 −2.04

Ablation study on re-sampling.

Bold indicates the best performing method.

However, the advantage of our approach is clearly evident in the case of challenging objects where the robot camera does not always provide a good observation due to partial symmetry or smaller object sizes. Our approach enables particles from different cameras to refine their estimates instead of quickly converging to specific estimates, as observed in the first baseline. Furthermore, it does not retain noisy particles all the time, as in the second baseline. Overall, then, our approach outperforms both baselines, particularly in the case of challenging objects (Table 4).

3.2.3.3 Pose estimation

In this study, we validate the pose estimation from the tracked distribution. We use the approach in Deng et al. (2021) and Naik et al. (2022) as the baseline, where the pose is estimated using the mean of the samples in the current distribution that could be associated with the pose estimate at the previous time-step . Our approach (“Ours”), in addition to the samples in the current distribution that could be associated with the pose estimate at the previous time-step , also considers the samples that could not be associated. As a result, it maintains a multiple set of hypotheses when sufficient evidence is not available to converge to a single pose estimate (Section 2.3.6).

The same tracked distribution was used to evaluate both methods, with performance measured using the AUC of ADD and ADD-S metrics.

It can be observed from Table 5 that Deng et al. (2021) generally perform poorly, as they assume that a good initial object pose estimate is available. This assumption is often not valid when tracking from a distance. Our approach maintains a multiple pose hypothesis when sufficient temporal evidence is not then available in support of the then pose estimate . It is thus able to recover from poor pose estimates and hence results in better performance.

TABLE 5

Object AUC of ADD AUC of ADD-S
PoseRBPF Ours (Section 2.3.6) PoseRBPF Ours (Section 2.3.6)
002_master_chef_can 27.65 29.19 82.11 80.38
003_cracker_box 41.67 55.80 74.23 81.14
004_sugar_box 35.09 42.95 71.40 73.67
006_mustard_bottle 47.79 50.01 78.71 82.88
010_potted_meat_can 29.32 35.66 74.00 73.19
011_banana 19.39 32.03 52.97 54.40
021_bleach_cleanser 35.72 49.67 71.41 76.83
024_bowl 25.12 37.63 70.86 74.03
025_mug 42.18 62.18 86.66 86.69
052_extra_large_clamp 09.81 11.34 35.10 35.54
All 31.37 40.64 69.74 71.87

Ablation study on pose estimation from tracked distribution.

Bold indicates the best performing method.

3.2.4 Different camera combinations

In this ablation study, we compare the performance of pose estimation and uncertainty quantification with fusion using different camera combinations. We consider three cases: i) robot camera and external camera 1 (EI); ii) robot camera and external camera 2 (E2); iii) robot camera, E1, and E2 (refer to Figure 1 for the positioning of the cameras). Results are presented in Table 6. The combination of the robot camera and E1 yields better performance than the combination of the robot camera and E2. The performance using the combination of the robot camera and E2 is very similar to the performance of just using the robot camera (Table 1). This is because E1 provides more informative views as it is positioned at a lower elevation than E2 (relative to the object frame). Consequently, E1 frequently has better views of the object’s front, back, and sides, which typically contain more unique features. In contrast, E2, due to its higher elevation, often has a better view of the object’s top compared to the views of its sides, front, and back.

TABLE 6

Object Pose estimation Uncertainty quantification
AUC of ADD AUC of ADD-S Log probability of ground truth translation Log probability of ground truth rotation
R-E1 R-E2 R-E1-E2 R-E1 R-E2 R-E1-E2 R-E1 R-E2 R-E1-E2 R-E1 R-E2 R-E1-E2
002_master_chef_can 47.61 25.74 46.93 83.93 77.49 84.48 −0.79 −1.17 −0.46 −6.03 −5.57 −5.98
003_cracker_box 55.57 47.19 62.29 80.32 79.63 82.76 −2.10 −2.18 −1.56 −3.46 −3.44 −2.98
004_sugar_box 38.98 42.76 47.69 77.89 70.95 77.55 −1.58 −4.71 −2.04 −6.11 −7.84 −6.35
006_mustard_bottle 35.03 46.23 43.04 83.52 77.06 80.84 −1.26 −3.11 −1.95 −4.11 −3.48 −3.63
010_potted_meat_can 27.02 08.05 33.13 47.90 19.28 60.97 −6.42 −9.85 −4.18 −6.93 −7.17 −6.54
011_banana 12.92 27.90 22.46 25.46 47.39 40.30 −6.18 −3.64 −4.70 −4.67 −3.46 −4.27
021_bleach_cleanser 50.68 47.44 55.73 73.26 63.61 77.51 −3.19 −5.00 −2.01 −5.11 −5.04 −4.82
024_bowl 34.56 26.79 37.46 61.69 46.65 64.53 −1.42 −3.84 −0.97 −3.99 −3.87 −3.96
025_mug 59.22 18.96 64.77 79.08 37.96 87.11 −1.25 −6.84 −0.31 −3.70 −6.66 −4.35
052_extra_large_clamp 28.28 23.62 27.35 46.50 39.51 48.57 −3.05 −5.06 −0.95 −7.16 −7.85 −5.81
All 38.98 31.46 44.08 65.95 55.95 70.46 −2.72 −4.54 −1.91 −5.12 −5.43 −4.86

Different camera combinations for fusion.

R, robot camera; E1, external camera 1; E2, external camera 2.

Bold indicates the best performing method.

However, when all the cameras are used together, there is a significant improvement in performance compared to using only the robot camera and E1. This is because E1 and E2 provide complementary views of the object (both are placed on opposite sides of the object; see Figure 1). These complementary views help resolve uncertainties and lead to improved pose estimates and uncertainty quantification.

3.3 Application: pre-grasp planning

We applied the proposed multi-view object pose distribution tracking framework for making informed decisions at various stages of mobile manipulation (MM). As described in Section 1, in addition to selecting the optimal grasp pose for grasping the object (Newbury et al., 2022), MM also requires selecting the optimal base pose from where the object can be grasped (Vahrenkamp et al., 2013; Makhal and Goins, 2018; Jauhri et al., 2022b).

An estimate of object pose is required to select the optimal base poses and grasp poses for grasping the objects. Additionally, estimates of the selected optimal grasp poses and base poses are also derived from the estimated object pose. Therefore, any error in estimated object pose can result in failures. An error in the base pose can potentially result in a situation where there are no inverse kinematics (IK) solutions for grasping the object after navigating to the estimated base pose, while an error in the estimated grasp pose might lead to a failed grasp.

As the proposed framework estimates the object pose and the associated uncertainty from a distance, the robot can make early decisions regarding the optimal base pose and grasp pose for grasping the object. Moreover, instead of directly performing the actions—navigating to the estimated base pose or attempting to grasp using the estimated grasp pose—the robot can project the object pose uncertainty onto the action space and assess whether the estimated uncertainty is acceptable for the successful execution of the action.

For the base pose, the robot can evaluate if deviations according to the uncertainty associated with the estimate of the selected base pose would still allow for IK solutions for grasping the object. Similarly, for the grasp pose, it can validate in the simulation (Kim et al., 2012) whether the deviations according to the uncertainty associated to the estimate of the selected optimal grasp pose would enable a successful grasp, as a gripper can compensate for the pose uncertainty. For mathematical details, we refer the reader to Naik et al. (2025).

In our experiments, we assumed that the top-down grasp pose was selected as the optimal grasp pose for grasping the objects. We examined four cases, each involving the use of

for decision-making at an increased number of stages than the previous case.

  • Fixed base pose and object pose estimation (FBOE): the robot goes to a fixed base pose near the table, estimates object pose, and attempts to grasp the object using the estimate of the selected grasp pose. This represents how MM is usually solved.

  • Object pose tracking and base pose estimation (OTBE): the robot starts tracking the object pose by means of our method, obtains an estimate of the selected base pose, and navigates to the estimated base pose. It then obtains the estimate of the selected grasp pose and attempts to grasp the object.

  • Object pose tracking and base IK monitoring (OTBM): OTBE plus the robot monitors the probability of the availability IK solution using the estimate of the selected base pose and the uncertainty in the object pose used for estimating the base pose. The robot decides to navigate to the estimated base pose only if the probability of success exceeds a pre-defined threshold.

  • Object pose tracking and base IK and grasp success monitoring (OTBGM): OTBM plus the robot monitors the probability of grasp success using the estimate of the selected grasp pose and the uncertainty in the object pose used for estimating the grasp pose. The robot decides to attempt to grasp using the estimated grasp pose only if the probability of success exceeds a pre-defined threshold.

In OTBM and OTBGM, the pre-defined threshold for both base and grasp poses was set to 60%—that is, if more than 60% of the errors could be compensated for, the action was executed. In practice, aside from errors in visual object pose estimates, several other factors can contribute to failures during action execution. These may include poor calibration between the external camera and the robot camera due to errors in robot localization, and inability to plan a trajectory to the intended grasp pose due to occlusion from other objects. To isolate these failures and only quantify failures due to errors in the visual object pose estimate, we created digital twins for the sequences in the MY-MM dataset using the recorded ground truths and NVIDIA Isaac simulator7. The decisions were based on the real-world data from the MY-MM dataset sequences, while actions were executed in the digital twin.

We evaluated five objects (“003_cracker_box,” “006_mustard_bottle,” “021_bleach_cleanser,” “024_bowl,” and “025_mug”) with diverse shapes, colors, and textures, conducting ten trials for each object and each case. FBOE and OTBE do not consider uncertainty in the object pose, resulting in a large number of failures (Table 7).

TABLE 7

Object FBOE OTBE OTBM OTBGM
SG AG SG AG SG AG SG AG
003_cracker_box 8 10 7 10 10 10 10 10
006_mustard_bottle 1 10 3 10 5 10 4 5
021_bleach_cleanser 3 10 3 10 5 8 4 7
024_bowl 4 10 3 10 7 9 7 9
025_mug 6 10 4 10 10 10 10 10
All 22 50 20 50 37 47 35 41
Success rate 44% 40% 78.72% 85.36%

Grasp success rate with and without using visual object pose uncertainty during decision-making.

SG, successful grasps; AG, attempted grasps.

Bold indicates the best performing method.

On the other hand, OTBM and OTBGM monitor uncertainties in the tracked object pose before making MM decisions. Consequently, they exhibit a much higher success rate and fewer failures. Most of the failures in OTBGM occurred due to the absence of a valid IK solution at the estimated base pose. To mitigate these failures, one option is to raise the threshold for decision-making, requiring more than 60% of estimated errors to be compensated for before executing the action. However, this strategy comes with the trade-off of delaying decision-making, which results in longer execution times.

In summary, the early pose estimates from the proposed object pose distribution tracking framework not only empower the mobile robot to make timely decisions regarding the optimal base pose and grasp pose for object manipulation but also the associated pose uncertainty, provided by the proposed framework, helping to prevent failures. In this way, the proposed framework contributes to advances in both the time efficiency and robustness of mobile manipulation.

4 Discussion

In this study, we have presented a particle filter-based framework for multi-view 6D object pose distribution tracking using the robot’s onboard camera and external cameras in the environment. We demonstrated that the proposed framework yields more accurate 6D object pose estimates than existing multi- and single-camera pose estimation methods and distributions compared to existing single-camera pose distribution estimation methods as the mobile robot approaches the objects for grasping. The external cameras not only help resolve ambiguities when the robot camera has ambiguous views of the object but also prevent the particle filter from diverging. Consequently, the particle filter quickly converges once the robot camera views start to improve, unlike when using only the robot camera, where the particle filter needs to be re-initialized. Finally, we contributed a multi-view YCB object pose tracking dataset for mobile manipulation that can serve as a benchmark for future research in this direction.

Our results are encouraging, as they indicate that the use of external cameras can significantly improve the accuracy of the estimated object pose and distributions in the robot camera. This can enable the mobile robot to make early and informed decisions, thereby improving both the time efficiency and robustness of mobile manipulation.

We here used the PoseRBPF method (Deng et al., 2021) for computing a discretized rotational distribution, which uses Euler angle representation. However, this introduces a risk of Gimbal lock during multi-view fusion. Future research should investigate the use of quaternions for discretizing SO(3), as proposed by Murphy et al. (2021) and Haugaard et al. (2023). Furthermore, the single-view observation likelihood model from PoseRBPF (Deng et al., 2021) used in this work exhibits reduced performance on YCB objects such as “011 banana” and “052 extra-large clamp” under occlusions. This highlights the need for more robust single-view observation likelihood models capable of handling such challenging cases. Finally, our framework can be combined with a partially observable Markov decision framework (PO-MDP), as applied in An et al. (2024), to determine actions that will reduce uncertainty and provide better pose estimates and distributions.

Statements

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at: https://nextcloud.sdu.dk/index.php/s/eQitCanx9ztLZwg.

Author contributions

LN: Software, Conceptualization, Writing – review and editing, Visualization, Methodology, Writing – original draft, Validation. TI: Writing – review and editing, Supervision. JW: Writing – review and editing. NK: Funding acquisition, Writing – review and editing, Project administration, Supervision.

Funding

The authors declare that financial support was received for the research and/or publication of this article. This work was supported by the European Union Horizon 2020 project Fluently (grant agreement no 958417) and by the Innovation Fund Denmark's FERA project (3149-00014A).

Acknowledgments

The authors would also like to thank the SDU I4.0 lab for providing the robot for experimental evaluation.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The authors declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Footnotes

1.^It should be noted that the observation likelihood is computed for a specific observation for a set of poses sampled on SE(3). Consequently, the histogram expressing the observation likelihood is not a normalized probability distribution. The likelihood of each bin is calculated using a cosine distance which has a value in the range of 0–1.

2.^ https://lakshadeep.github.io/mymm/.

3.^It should be noted that in the case of continuous distributions such as a mixture of Gaussians used for modeling , the probability assigned to the ground truth translation is actually a likelihood.

4.^It was found that the Detectron2 models trained on the photo-realistic data for detecting the objects in the external cameras were very sensitive to the seeds. Hence, four Detectron2 models were trained with different initial weights. For each object class, validation data were used to determine the best-performing Detectron2 model. For reproducibility and comparative evaluation with future research, we have made these detections available in the dataset.

5.^ https://lakshadeep.github.io/mvopdt/.

6.^The single-view pose estimator proposed in CosyPose performed poorly on the MY-MM dataset when the distance between the objects and the camera was large. We attribute this to the fact that the pre-trained models were trained on data where the objects were typically close to the camera and did not use depth information. Therefore, for a fair comparison, we used poses estimated using PoseRBPF in each camera at all time steps and refined them using the multi-view refinement method proposed in CosyPose.

7.^ https://developer.nvidia.com/isaac-sim.

References

  • 1

    An B. Geng Y. Chen K. Li X. Dou Q. Dong H. (2024). “Rgbmanip: monocular image-based robotic manipulation through active object pose estimation,” in 2024 IEEE international conference on robotics and automation (ICRA) (IEEE).

  • 2

    Calli B. Walsman A. Singh A. Srinivasa S. Abbeel P. Dollar A. M. (2015). Benchmarking in manipulation research: using the yale-cmu-berkeley object and model set. IEEE Robotics and Automation Mag.22, 3652. 10.1109/mra.2015.2448951

  • 3

    Choi C. Christensen H. I. (2012a). “3d pose estimation of daily objects using an rgb-d camera,” in 2012 IEEE/RSJ international conference on intelligent robots and systems (IEEE), 33423349.

  • 4

    Choi C. Christensen H. I. (2012b). “3d textureless object detection and tracking: an edge-based approach,” in 2012 IEEE/RSJ international conference on intelligent robots and systems (IEEE), 38773884.

  • 5

    Collet A. Srinivasa S. S. (2010). “Efficient multi-view object recognition and full pose estimation,” in 2010 IEEE international conference on robotics and automation (IEEE), 20502055.

  • 6

    Collet A. Martinez M. Srinivasa S. S. (2011). The moped framework: object recognition and pose estimation for manipulation. Nternational Journal Robotics Research30, 12841306. 10.1177/0278364911401765

  • 7

    Corona E. Kundu K. Fidler S. (2018). “Pose estimation for objects with rotational symmetry,” in 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS) (IEEE), 72157222.

  • 8

    Deng X. Mousavian A. Xiang Y. Xia F. Bretl T. Fox D. (2019). Poserbpf: a rao-blackwellized particle filter for 6d object pose tracking. arXiv e-prints.

  • 9

    Deng X. Xiang Y. Mousavian A. Eppner C. Bretl T. Fox D. (2020). “Self-supervised 6d object pose estimation for robot manipulation,” in 2020 IEEE international conference on robotics and automation (ICRA) (IEEE), 36653671.

  • 10

    Deng X. Mousavian A. Xiang Y. Xia F. Bretl T. Fox D. (2021). Poserbpf: a rao–blackwellized particle filter for 6-d object pose tracking. IEEE Trans. Robotics37, 13281342. 10.1109/tro.2021.3056043

  • 11

    Doucet A. De Freitas N. Gordon N. J. (2001). Sequential monte carlo methods in practice, vol. 1. Springer.

  • 12

    Drost B. Ulrich M. Navab N. Ilic S. (2010). “Model globally, match locally: efficient and robust 3d object recognition,” in 2010 IEEE computer society conference on computer vision and pattern recognition (IEEE), 9981005.

  • 13

    Durrant-Whyte H. Henderson T. C. (2016). “Multisensor data fusion,” in Springer handbook of robotics (Berlin, Heidelberg: Springer), 867896.

  • 14

    Erkent O. Shukla D. Piater J. (2016). “Integration of probabilistic pose estimates from multiple views,” in European conference on computer vision (Springer), 154170.

  • 15

    Garcia-Haro J. M. Oña E. D. Hernandez-Vicen J. Martinez S. Balaguer C. (2020). Service robots in catering applications: a review and future challenges. Electronics10, 47. 10.3390/electronics10010047

  • 16

    Glover J. Bradski G. Rusu R. B. (2012). Monte carlo pose estimation with quaternion kernels and the bingham distribution. Robotics Science Systems7, 97104. 10.7551/mitpress/9481.003.0018

  • 17

    Haugaard R. L. Buch A. G. (2022). “Surfemb: dense and continuous correspondence distributions for object pose estimation with learnt surface embeddings,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 67496758.

  • 18

    Haugaard R. L. Iversen T. M. (2023). “Multi-view object pose estimation from correspondence distributions and epipolar geometry,” in 2023 IEEE international conference on robotics and automation (ICRA) (IEEE), 17861792.

  • 19

    Haugaard R. L. Hagelskjær F. Iversen T. M. (2023). Spyropose: importance sampling pyramids for object pose distribution estimation in se (3). arXiv preprint arXiv:2303.05308.

  • 20

    Helmer S. Meger D. Muja M. Little J. J. Lowe D. G. (2010). “Multiple viewpoint recognition and localization,” in Asian conference on computer vision (Springer), 464477.

  • 21

    Henze N. Zirkler B. (1990). A class of invariant consistent tests for multivariate normality. Commun. Statistics-Theory Methods19, 35953617. 10.1080/03610929008830400

  • 22

    Hinterstoisser S. Holzer S. Cagniart C. Ilic S. Konolige K. Navab N. et al (2011). “Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes,” in 2011 international conference on computer vision (IEEE), 858865.

  • 23

    Hinterstoisser S. Lepetit V. Ilic S. Holzer S. Bradski G. Konolige K. et al (2013). “Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes,” in Computer Vision–ACCV 2012: 11th Asian conference on computer vision, Daejeon, Korea, November 5-9, 2012, revised selected papers, part I 11 (Springer), 548562.

  • 24

    Hodaň T. Haluza P. Obdržálek Š. Matas J. Lourakis M. Zabulis X. (2017). “T-LESS: an RGB-D dataset for 6D pose estimation of texture-less objects,” in IEEE winter conference on applications of computer vision (WACV).

  • 25

    Hvilshøj M. Bøgh S. Skov Nielsen O. Madsen O. (2012). Autonomous industrial mobile manipulation (aimm): past, present and future. Industrial Robot Int. J.39, 120135. 10.1108/01439911211201582

  • 26

    Iversen T. M. Haugaard R. L. Buch A. G. (2022). Ki-pode: keypoint-Based implicit pose distribution estimation of rigid objects. arXiv preprint arXiv:2209.09659.

  • 27

    Jauhri S. Peters J. Chalvatzaki G. (2022a). Robot learning of Mobile manipulation with reachability behavior priors. IEEE Robotics Automation Lett.7, 83998406. 10.1109/LRA.2022.3188109

  • 28

    Jauhri S. Peters J. Chalvatzaki G. (2022b). Robot learning of Mobile manipulation with reachability BehaviorPriors. IEEE RA-L7, 83998406. 10.1109/lra.2022.3188109

  • 29

    Kaba M. D. Uzunbas M. G. Lim S. N. (2017). “A reinforcement learning approach to the view planning problem,” in 2017 IEEE conference on computer vision and pattern recognition (CVPR) (IEEE), 50945102.

  • 30

    Kanezaki A. Matsushita Y. Nishida Y. (2018). “Rotationnet: joint object categorization and pose estimation using multiviews from unsupervised viewpoints,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 50105019.

  • 31

    Kehl W. Manhardt F. Tombari F. Ilic S. Navab N. (2017). “Ssd-6d: making rgb-based 3d detection and 6d pose estimation great again,” in Proceedings of the IEEE international conference on computer vision, 15211529.

  • 32

    Kendall A. Grimes M. Cipolla R. (2015). “Posenet: a convolutional network for real-time 6-dof camera relocalization,” in Proceedings of the IEEE international conference on computer vision, 29382946.

  • 33

    Khaleghi B. Khamis A. Karray F. O. Razavi S. N. (2013). Multisensor data fusion: a review of the state-of-the-art. Inf. Fusion14, 2844. 10.1016/j.inffus.2011.08.001

  • 34

    Kim J. Iwamoto K. Kuffner J. J. Ota Y. Pollard N. S. (2012). “Physically-based grasp quality evaluation under uncertainty,” in 2012 IEEE international conference on robotics and automation (IEEE), 32583263.

  • 35

    Krizhevsky A. Sutskever I. Hinton G. E. (2012). “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 10971105.

  • 36

    Labbé Y. Carpentier J. Aubry M. Sivic J. (2020). “Cosypose: consistent multi-view multi-object 6d pose estimation,” in European conference on computer vision (Springer), 574591.

  • 37

    Li C. Bai J. Hager G. D. (2018a). “A unified framework for multi-view multi-class object pose estimation,” in Proceedings of the European conference on computer vision (Munich: ECCV), 254269.

  • 38

    Li Y. Wang G. Ji X. Xiang Y. Fox D. (2018b). “Deepim: deep iterative matching for 6d pose estimation,” in Proceedings of the European conference on computer vision (Munich: ECCV), 683698.

  • 39

    Lim J. J. Khosla A. Torralba A. (2014). “Fpm: fine pose parts-based model with 3d cad models,” in European conference on computer vision (Springer), 478493.

  • 40

    Makhal A. Goins A. K. (2018). “Reuleaux: robot base placement by reachability analysis,” in IEEE int. Conf. On robotic computing.

  • 41

    Manhardt F. Kehl W. Navab N. Tombari F. (2018). “Deep model-based 6d pose refinement in rgb,” in Proceedings of the European conference on computer vision (Munich: ECCV), 800815.

  • 42

    Manhardt F. Arroyo D. M. Rupprecht C. Busam B. Birdal T. Navab N. et al (2019). “Explaining the ambiguity of object detection and 6d pose from visual data,” in Proceedings of the IEEE international conference on computer vision, 68416850.

  • 43

    Mendes E. Koch P. Lacroix S. (2016). “Icp-based pose-graph slam,” in 2016 IEEE international symposium on safety, security, and rescue robotics (SSRR) (IEEE), 195200.

  • 44

    Murphy K. Russell S. (2001). “Rao-blackwellised particle filtering for dynamic bayesian networks,” in Sequential monte carlo methods in practice (Springer), 499515.

  • 45

    Murphy K. A. Esteves C. Jampani V. Ramalingam S. Makadia A. (2021). “Implicit-pdf: non-Parametric representation of probability distributions on the rotation manifold,” in International conference on machine learning (PMLR), 78827893.

  • 46

    Naik L. (2024). Pre-grasp planning for time-efficient and robust mobile manipulation. Odense: University of Southern Denmark. PhD thesis.

  • 47

    Naik L. Iversen T. M. Kramberger A. Wilm J. Krüger N. (2022). “Multi-view object pose distribution tracking for pre-grasp planning on mobile robots,” in 2022 IEEE international conference on robotics and automation (ICRA) (IEEE).

  • 48

    Naik L. Kalkan S. Krüger N. (2024a). Pre-grasp approaching on mobile robots: a pre-active layered approach. IEEE Robotics Automation Lett.9, 26062613. 10.1109/lra.2024.3358077

  • 49

    Naik L. Kalkan S. Sørensen S. L. Kjærgaard M. B. Krüger N. (2024b). “Basenet: a learning-based mobile manipulator base pose sequence planning for pickup tasks,” in 2024 IEEE/RSJ international conference on Intelligent robots and systems (IROS) (IEEE), 36663673.

  • 50

    Naik L. Iversen T. M. Kramberger A. Krüger N. (2025). Robotic task success evaluation under multi-modal non-parametric object pose uncertainty. Industrial Robot The International Journal Robotics Research Application52, 651662. 10.1108/ir-10-2024-0467

  • 51

    Newbury R. Gu M. Chumbley L. Mousavian A. Eppner C. Leitner J. et al (2022). Deep learning approaches to grasp synthesis: a review. arXiv preprint arXiv:2207.02556.

  • 52

    Okorn B. Xu M. Hebert M. Held D. (2020). “Learning orientation distributions for object pose estimation,” in 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS) (IEEE), 1058010587.

  • 53

    Pavlakos G. Zhou X. Chan A. Derpanis K. G. Daniilidis K. (2017). “6-dof object pose from semantic keypoints,” in 2017 IEEE international conference on robotics and automation (ICRA) (IEEE), 20112018.

  • 54

    Peng S. Liu Y. Huang Q. Zhou X. Bao H. (2019). “Pvnet: pixel-wise voting network for 6dof pose estimation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 45614570.

  • 55

    Rad M. Lepetit V. (2017). “Bb8: a scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth,” in Proceedings of the IEEE international conference on computer vision, 38283836.

  • 56

    Salas-Moreno R. F. Newcombe R. A. Strasdat H. Kelly P. H. Davison A. J. (2013). “Slam++: simultaneous localisation and mapping at the level of objects,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 13521359.

  • 57

    Schütze H. Manning C. D. Raghavan P. (2008). Introduction to information retrieval, vol. 39. Cambridge University Press Cambridge.

  • 58

    Sereinig M. Werth W. Faller L.-M. (2020). A review of the challenges in Mobile manipulation: systems design and robocup challenges. Elektrotech. Inf.137, 297308. 10.1007/s00502-020-00823-8

  • 59

    Shi G. Zhu Y. Tremblay J. Birchfield S. Ramos F. Anandkumar A. et al (2021). “Fast uncertainty quantification for deep object pose estimation,” in 2021 IEEE international conference on robotics and automation (ICRA) (IEEE), 52005207.

  • 60

    Su H. Maji S. Kalogerakis E. Learned-Miller E. (2015). “Multi-view convolutional neural networks for 3d shape recognition,” in Proceedings of the IEEE international conference on computer vision, 945953.

  • 61

    Sundermeyer M. Marton Z.-C. Durner M. Brucker M. Triebel R. (2018). “Implicit 3d orientation learning for 6d object detection from rgb images,” in Proceedings of the European conference on computer vision (Munich: ECCV), 699715.

  • 62

    Susanto W. Rohrbach M. Schiele B. (2012). “3d object detection with multiple kinects,” in European conference on computer vision (Springer), 93102.

  • 63

    Thakar S. Srinivasan S. Al-Hussaini S. Bhatt P. M. Rajendran P. Jung Yoon Y. et al (2023). A survey of wheeled mobile manipulation: a decision-making perspective. J. Mech. Robotics15, 020801. 10.1115/1.4054611

  • 64

    Thrun S. (2002). Probabilistic robotics. Commun. ACM45, 5257. 10.1145/504729.504754

  • 65

    Tremblay J. To T. Sundaralingam B. Xiang Y. Fox D. Birchfield S. (2018). “Deep object pose estimation for semantic robotic grasping of household objects,” in Conference on robot learning (Zurich: PMLR), 306316.

  • 66

    Vahrenkamp N. Asfour T. Dillmann R. (2013). “Robot placement based on reachability inversion,” in IEEE ICRA.

  • 67

    Wang C. Xu D. Zhu Y. Martín-Martín R. Lu C. Fei-Fei L. et al (2019). “Densefusion: 6d object pose estimation by iterative dense fusion,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 33433352.

  • 68

    Wang C. Martín-Martín R. Xu D. Lv J. Lu C. Fei-Fei L. et al (2020). “6-pack: category-level 6d pose tracker with anchor-based keypoints,” in 2020 IEEE international conference on robotics and automation (ICRA) (IEEE), 1005910066.

  • 69

    Wen B. Mitash C. Ren B. Bekris K. E. (2020). “Se (3)-tracknet: data-Driven 6d pose tracking by calibrating image residuals in synthetic domains,” in 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS) (IEEE), 1036710373.

  • 70

    Wu Y. Kirillov A. Massa F. Lo W.-Y. Girshick R. (2019). Detectron2. Available online at: https://github.com/facebookresearch/detectron2.

  • 71

    Xiang Y. Schmidt T. Narayanan V. Fox D. (2017). Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199.

  • 72

    Xie H. Yao H. Sun X. Zhou S. Zhang S. (2019). “Pix2vox: context-aware 3d reconstruction from single and multi-view images,” in Proceedings of the IEEE international conference on computer vision, 26902698.

  • 73

    Zakharov S. Shugurov I. Ilic S. (2019). “Dpod: 6d pose object detector and refiner,” in Proceedings of the IEEE international conference on computer vision, 19411950.

  • 74

    Zeng A. Yu K.-T. Song S. Suo D. Walker E. Rodriguez A. et al (2017). “Multi-view self-supervised deep learning for 6d pose estimation in the amazon picking challenge,” in 2017 IEEE international conference on robotics and automation (ICRA) (IEEE), 1386.

  • 75

    Zia Ur Rahman M. Usman M. Farea A. Ahmad N. Mahmood I. Imran M. (2022). Vision-based mobile manipulator for handling and transportation of supermarket products. Math. Problems Eng.2022, 110. 10.1155/2022/3883845

Summary

Keywords

pose estimation, uncertainty quantification, pose distribution tracking, multi-camera fusion, mobile manipulation

Citation

Naik L, Iversen TM, Wilm J and Krüger N (2026) Multi-view object pose distribution tracking for pre-grasp planning on mobile robots. Front. Robot. AI 12:1683931. doi: 10.3389/frobt.2025.1683931

Received

11 August 2025

Revised

03 November 2025

Accepted

10 November 2025

Published

14 January 2026

Volume

12 - 2025

Edited by

Xin Zhang, University of Portsmouth, United Kingdom

Reviewed by

Yinlong Liu, University of Macau, China

Haiyuan Gui, Qingdao University of Technology, China

Updates

Copyright

*Correspondence: Lakshadeep Naik,

† Present address: Lakshadeep Naik, Division of Robotics, Perception and Learning at KTH Royal Institute of Technology, Stockholm, Sweden

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics