Progress in symmetry preserving robot perception and control through geometry and learning

Ghaffari, Maani; Zhang, Ray; Zhu, Minghan; Lin, Chien Erh; Lin, Tzu-Yuan; Teng, Sangli; Li, Tingjun; Liu, Tianyi; Song, Jingwei

doi:10.3389/frobt.2022.969380

ORIGINAL RESEARCH article

Front. Robot. AI, 14 September 2022

Sec. Field Robotics

Volume 9 - 2022 | https://doi.org/10.3389/frobt.2022.969380

This article is part of the Research TopicRising Stars in Field Robotics: 2022View all 6 articles

Progress in symmetry preserving robot perception and control through geometry and learning

Maani Ghaffari*

Ray Zhang

Minghan Zhu

Chien Erh Lin

Tzu-Yuan Lin

Sangli Teng

Tingjun Li

Tianyi Liu

Jingwei Song

Computational Autonomy and Robotics Laboratory (CURLY), University of Michigan, Ann Arbor, MI, United States

This article reports on recent progress in robot perception and control methods developed by taking the symmetry of the problem into account. Inspired by existing mathematical tools for studying the symmetry structures of geometric spaces, geometric sensor registration, state estimator, and control methods provide indispensable insights into the problem formulations and generalization of robotics algorithms to challenging unknown environments. When combined with computational methods for learning hard-to-measure quantities, symmetry-preserving methods unleash tremendous performance. The article supports this claim by showcasing experimental results of robot perception, state estimation, and control in real-world scenarios.

1 Introduction

Understanding the underlying principles of intelligence is at the heart of Artificial Intelligence (AI) and its applications for robotics, i.e., embodied AI, towards building a fully adaptive autonomous system capable of operating in the real world. Computational mathematics and intelligence have become a pivot for these fields, given the current advances in hardware. By combining, unifying, and expanding our mathematical and data-driven understanding of these areas of science and research, one can push the boundaries towards a unifying cognitive model that.

1. is robust to challenging environments and behavior modes;

2. takes into account hierarchical semantic knowledge of the scene such as objects and affordances as well as the geometry;

3. possesses sufficient mathematical and computational structures to be exploited for developing efficient and generalizable algorithms;

4. follows compositional principles to assemble integrated models that can produce outcomes bigger than the sum of individual modules.

This work provides an overview of our recent efforts for robot perception and control methods that can leverage structures such as symmetry and data simultaneously. Roughly speaking, symmetry of an object is a motion that leaves it unchanged (Tapp, 2021). For example, consider the sphere $S^{2} = {(x_{1}, x_{2}, x_{3}) \in R^{3} ∣ x_{1}^{2} + x_{2}^{2} + x_{3}^{2} = 1}$ . Its symmetry group is the three-dimensional orthogonal group O(3), i.e., the disjoint union of all 3D rotations and reflections. No matter how we rotate the sphere, its shape remains the same. More generally, Lie groups model the continuous symmetry of geometric spaces and are equipped with a natural coordinates system called exponential coordinates. An important consequence of this observation is that we can formulate problems more naturally where the Lie group action commutes with the (data-driven) functional representation of data (Section 2), the state estimation and control error dynamics become independent of the current operating point, and only depend on the desired relative motion (Sections 3 and Section 4), and we can lift multimodal signals, including images and point clouds, to some Lie algebras via equivariant networks (Section 5).

Section 2 presents a nonparametric analytical framework that models semantically labeled point clouds for solving the sensor registration problem (Ghaffari et al., 2019; Clark et al., 2021; Zhang et al., 2021). The framework lifts the data into a Reproducing Kernel Hilbert Space (RKHS), where the inner product structure captures the cross-correlation between two labeled point clouds as functions. This framework is an example of an equivariant model for modeling data where a Lie group transformation acts on these functions to align them.

Section 3 presents a robot state estimation framework using an invariant Kalman filtering (Barrau and Bonnabel, 2017; Barrau and Bonnabel, 2018; Hartley et al., 2020) and deep learning for estimating contact events from multi-modal proprioceptive sensory data (Lin et al., 2022). The novel combination of a geometric filter on Lie groups with deep learning to provide learned contact events without physical sensors show a promising direction on how to integrate real-time deep learning in high-frequency robot state estimation tasks.

Section 4 provides an overview of the error-state Model Predictive Control (MPC) on Lie groups and the stability analysis by a Lyapunov function expressed in the Lie algebra (Teng et al., 2022a; Teng et al., 2022b). We derive the linearized configuration error dynamics and equations of motion in the Lie algebra (tangent space at the identity) that, given an initial condition, are globally valid and independent of the system trajectory. This approach leads to a convex MPC algorithm for the tracking control problem using the linearized error dynamics, which can be solved efficiently using Quadratic Programming (QP) solvers. The proposed controller is validated in experiments on quadrupedal robot pose control and locomotion.

Section 5 presents recent frameworks for equivariant feature learning and their applications in registration and place recognition tasks (Zhu et al., 2022b). We learn an embedding for each input in a feature space that preserves the equivariance property, enabled by recent developments in symmetry-preserving neural networks. Symmetry (or equivariance) in a neural network enables efficient learning (by removing the need for data augmentation), generalization, and a clear connection between the changes in the input and output spaces, i.e., explainability.

Finally, Section 6 provides closing remarks by summarizing our new findings and their impacts on robot perception and control. We also discuss future opportunities enabled by the presented results in this article.

2 RKHS registration for spatial-semantic perception

Point clouds obtained by modern sensors such as RGB-D cameras, stereo cameras, and LIDARs contain up to 300, 000 points per scan at 10–60Hz and rich color and intensity (reflectivity of a material sensed by an active light beam) measurements besides the geometric information. In addition, deep learning (LeCun et al., 2015) can provide semantic attributes of the scene as measurements (Long et al., 2015; Chen et al., 2017; Zhu et al., 2019).

Illustrated in Figure 1, the following formulation provides a general framework for lifting semantically labeled point clouds into a function space to solve a registration problem (Ghaffari et al., 2019; Clark et al., 2021; Zhang et al., 2021). Consider two (finite) collections of points, X = {x_i}, $Z = {z_{j}} \subset R^{3}$ . We want to determine which element h ∈ SE(3), aligns the two point clouds X and hZ = {hz_j} the “best.” To assist with this, we will assume that each point contains information described by a point in an inner product space, $(I, {⟨ \cdot, \cdot ⟩}_{I})$ . To this end, we will introduce two labeling functions, $ℓ_{X} : X \to I$ and $ℓ_{Z} : Z \to I$ . To measure their alignment, we turn the point clouds, X and Z, into functions $f_{X}, f_{Z} : R^{3} \to I$ that live in some RKHS, $(H, {⟨ \cdot, \cdot ⟩}_{H})$ . The action, $SE (3) ↷ R^{3}$ induces an action $SE (3) ↷ H$ by h. f(x)≔f(h⁻¹x). Inspired by this observation, we will set $h . f_{Z} ≔ f_{h^{- 1} Z}$ .

FIGURE 1

FIGURE 1. Point clouds X and Z are represented by two continuous functions f_X, f_Z in an RKHS. Each point x_i has its own semantic labels, ℓ_X(x_i), encoded in the corresponding function representation via a tensor product representation (Zhang et al., 2021). The registration is formulated as maximizing the inner product between two point cloud functions.

Problem 1. The problem of aligning the point clouds can now be rephrased as maximizing the scalar products of f_X and h. f_Z, i.e., we want to solve

\underset{h \in SE (3)}{arg max} F (h), F (h) ≔ {⟨ f_{X}, f_{h^{- 1} Z} ⟩}_{H} . (1)

2.1 Constructing the functions

For the kernel of our RKHS, $H$ , we first choose the squared exponential kernel $k : R^{3} \times R^{3} \to R$ :

k (x, z) = σ^{2} \exp (\frac{- ‖ x - z ‖_{3}^{2}}{2 ℓ^{2}}), (2)

for some fixed real parameters (hyperparameters) σ and ℓ (the lengthscale), and ‖ ⋅‖₃ is the standard Euclidean norm on $R^{3}$ . This allows us to turn the point clouds to functions via $f_{X} (\cdot) ≔ \sum_{x_{i} \in X} ℓ_{X} (x_{i}) k (\cdot, x_{i})$ and $f_{h^{- 1} Z} (\cdot) ≔ \sum_{z_{j} \in Z} ℓ_{Z} (z_{j}) k (\cdot, h^{- 1} z_{j})$ . Here ℓ_X(x_i) encodes the semantic information, for example LIDAR intensity and image pixel color. k(⋅, x_i) encodes the geometric information. We can now obtain the inner product of f_X and f_Z as

〈 f_{X}, f_{h^{- 1} Z} 〉_{H} ≔ \sum_{\begin{array}{c} x_{i} \in X, z_{j} \in Z \end{array}} 〈 ℓ_{X} (x_{i}), ℓ_{Z} (z_{j}) 〉_{I} \cdot k (x_{i}, h^{- 1} z_{j}) (3)

We use the kernel trick (Murphy, 2012) to substitute the inner products in (3) with the semantic kernel as ${⟨ f_{X}, f_{h^{- 1} Z} ⟩}_{H} = \sum_{\begin{array}{c} x_{i} \in X, z_{j} \in Z \end{array}} k_{c} (ℓ_{X} (x_{i}), ℓ_{Z} (z_{j})) \cdot k (x_{i}, h^{- 1} z_{j})$ . We choose k_c to be the squared exponential kernel with real hyperparameters σ_c and ℓ_c that are set independently.

2.2 Feature embedding via tensor product representation

We now extend the feature space to a hierarchical distributed representation to incorporate the full geometric and hierarchical semantic relationship between the two point clouds. Let (V₁, V₂, … ) be different inner product spaces describing different types of non geometric features of a point, such as color, intensity, and semantics. Their tensor product, V₁ ⊗ V₂ ⊗… is also an inner product space. For any x ∈ X, z ∈ Z with features ℓ_X(x) = (u₁, u₂, … ) and ℓ_Z(z) = (v₁, v₂, … ), with u₁, v₁ ∈ V₁, u₂, v₂ ∈ V₂, … , we have

〈 ℓ_{X} (x), ℓ_{Z} {(z) 〉}_{I} = 〈 u_{1} \otimes u_{2} \otimes \dots, v_{1} \otimes v_{2} \otimes \dots 〉 = 〈 u_{1}, v_{1} 〉 \cdot 〈 u_{2}, v_{2} 〉 \cdot \dots . (4)

By substituting (4) into (3), we obtain ${⟨ f_{X}, f_{h^{- 1} Z} ⟩}_{H} = \sum_{\begin{array}{c} x_{i} \in X, z_{j} \in Z \end{array}} ⟨ u_{1 i}, v_{1 j} ⟩ \cdot ⟨ u_{2 i}, v_{2 j} ⟩ \dots k (x_{i}, h^{- 1} z_{j})$ . After applying the kernel trick we arrive at

{⟨ f_{X}, f_{h^{- 1} Z} ⟩}_{H} = \sum_{\begin{array}{c} x_{i} \in X, z_{j} \in Z \end{array}} k (x_{i}, h^{- 1} z_{j}) \cdot \prod_{k} k_{V_{k}} (u_{k i}, v_{k j}) ≔ \sum_{\begin{array}{c} x_{i} \in X, z_{j} \in Z \end{array}} k (x_{i}, h^{- 1} z_{j}) \cdot c_{i j} . (5)

Each c_ij does not depend on the relative transformation. It is worth noting that, when choosing the squared exponential kernel and when the input point clouds have only geometric information, c_ij will be identity, and (5) has the same formulation as Kernel Correlation (Tsin and Kanade, 2004).

2.3 Equivariance property

If instead of working with the inverse of the transformation acting on the function basis we work with the function input, then the equivariance property becomes evident. Let $C (R^{3})$ be the set of point clouds on $R^{3}$ and $H$ be the RKHS. Let $f : C (R^{3}) \to H$ be our map which assigns a function to a point cloud. Consider the space of smooth functions on $R^{3}$ , $C^{\infty} (R^{3})$ , and let the group $G$ act on $R^{3}$ . The action lifts to an action on $C^{\infty} (R^{3})$ via g. f(x) = f(g⁻¹x), $g \in G$ . This inverse is needed to make the action a group action:

(h g) . f (x) = h . f (g^{- 1} x) = f (g^{- 1} h^{- 1} x) = f ({(h g)}^{- 1} x), h, g \in G .

Now let Z be a point cloud and f_Z be its associated function. If $G$ acts on $R^{3}$ via isometries, then k(gx, gz) = k(x, z) and we have

g . f_{Z} (x) = f_{Z} (g^{- 1} x) = \sum_{j} ℓ_{Z} (z_{j}) \cdot k (g^{- 1} x, z_{j}) = \sum_{j} ℓ_{j} \cdot k (x, g z_{j}) = f_{g Z} (x) .

2.4 Experimental results

We present the point cloud registration experiments on real world outdoor and indoor datasets: KITTI (Geiger et al., 2012) odometry and TUM RGB-D data set (Sturm et al., 2012), with the following setup: All experiments are performed in a frame-to-frame manner without skipping images. The first frame’s transformation is initialized with identity, and all later frames start with the previous frames’ results. The same hyperparameter values such as lengthscale of the kernels in (2) are used for the proposed registration methods within one data set. All the baselines except Robust-ICP (Zhang et al., 2022) use all the pixels without downsampling because they do not provide an optimal point selection scheme. Fast-Robust-ICP and the proposed methods select a subset of pixels via OpenCV’s FAST (Rosten and Drummond, 2006) feature detector to reduce the frame-wise running time.

The qualitative and quantitative results on KITTI Stereo is provided in Figure 2, Figure 3, and Table 1, respectively. The baselines are GICP (Segal et al., 2009), Multichannel-ICP (Servos and Waslander, 2014), 3D-NDT (Magnusson et al., 2007), and Robust-ICP (Zhang et al., 2022). GICP and NDT are compared with our geometric registration method (Geometric CVO, i.e., ℓ_X(x_i) = ℓ_Z(z_j) = 1). Multichannel-ICP competes with our color-assisted registration method (Color CVO). GICP and 3D-NDT implementation are from PCL (Rusu and Cousins, 2011). The Robust-ICP implementation is from its open source Github repositiory. The Multichannel-ICP implementation is from (Parkison et al., 2019). The semantic predictions of the images come from Nvidia’s pre-trained neural network (Zhu et al., 2019), which was trained on 200 labeled images on KITTI. The depth values of the stereo images are generated with ELAS (Geiger et al., 2010). All the baselines and the proposed methods remove the first 100 rows of image pixels that mainly include sky pixels, as well as points that are more than 55 m away. Averaged over sequence 00 to 10, our geometric method has a lower translational error (4.55%) comparing to the GICP (11.23%), NDT (8.50%), and Robust-ICP (11.02%). Our color version has a lower average translational drift (3.69%) than Multichannel-ICP (14.10%). If we add semantic information the error is further reduced (3.64%). In addition, excluding the image I/O and point cloud generation operations, the proposed implementations takes on average 1.4 s per frame on GPU when registering less than 15k downsampled points. Fast-Robust-ICP also takes downsampled point clouds and takes 0.3 s per frame on CPU. GICP, NDT, and Multichannel-ICP on CPU use full point clouds (150k-350k points), and take 6.3, 6.6, and 57 s per frame, respectively.

FIGURE 2

FIGURE 2. Stacked semantic and color point clouds based on frame-to-frame registration results using KITTI (Geiger et al., 2012) LiDAR, TUM RGB-D (Sturm et al., 2012) and KITTI Stereo sensors.

FIGURE 3

FIGURE 3. An illustration of the proposed registration methods on KITTI Stereo (Geiger et al., 2012) sequence 01 (left) and 07 (right) versus the baselines. The black dashed trajectory is the ground truth. The dot-dashed trajectories are the baselines. Plotted with EVO (Grupp, 2017).

TABLE 1

TABLE 1. Results of the proposed frame-to-frame method using the KITTI (Geiger et al., 2012) stereo odometry benchmark averaged over Sequence 00–10. The table lists the average drift in translation, as a percentage (%), and rotation, in degrees per meter(°/m). The drifts are calculated for all possible subsequences of 100, 200 …., 800 m.

The qualitative and quantitative results on TUM RGB-D is provided in Figure 2 and Table 2, respectively. We evaluated our method on the fr1 sequences, which are recorded in an office environment, and fr3 sequences, which contain image sequences in structured/nostructured and texture/notextured environments. We use the same baselines for geometric registration as KITTI. We compare Color CVO with Dense Visual Odometry (DVO) (Kerl et al., 2013) and Color ICP (Park et al., 2017). We reproduced DVO results with the code from (Pizenberg, 2019). The Color ICP implementation is taken from Open3D (Zhou et al., 2018). From Table 2, the proposed geometric registration outperforms the geometric baselines and achieves a similar performance to DVO and Color ICP. Moreover, with color information, the average error of the proposed registration decreases.

TABLE 2

TABLE 2. The RMSE of Relative Pose Error (RPE) averaged over TUM RGB-D (Sturm et al., 2012) fr1 and fr3 structure v.s texture sequences. The t columns show the RMSE of the translational drift in m/sec and the r columns show the RMSE of the rotational error in deg /sec. The RMSE is averaged over all sequences.

2.5 Discussions and limitations

Results in Section 2.4 demonstrate that embedding features like color and semantics in function representations provide finer data associations. Specifically, in (5), the extra appearance information c_ij encodes the similarity in color or semantics between the two associated points. It eliminates pairwise associations whose color or semantic appearances do not agree. Moreover, each point x_i ∈ X is matched to multiple points z_j ∈ Z. The proposed color registration significantly improves over geometric-only methods in both KITTI Stereo and TUM RGB-D datasets.

One limitation of the proposed method is the computational complexity introduced by the double sum in (5). However, the double sum is sparse because a point x_i ∈ X is far away from the majority of the points z_j ∈ Z, either in the spatial (geometry) space or one of the feature (semantic) spaces. But this similarity still has to be calculated with the help of GPU implementations or K-nearest-neighbor search (Blanco and Rai, 2014). In practice, an efficient point selection mechanism like FAST (Rosten and Drummond, 2006) corner selector or DSO’s (Engel et al., 2017) image gradient-based pixel selector can reduce the computation time. Alternatively, representation learning can be a way to reduce the number of input points while providing richer features.

3 Learning-aided invariant robot sate estimation

Matrix Lie groups (Chirikjian, 2011; Hall, 2015; Barfoot, 2017) provide natural (exponential) coordinates that exploits symmetries of the space (Long et al., 2013; Barfoot and Furgale, 2014; Forster et al., 2016; Mangelson et al., 2020; Mahony and Trumpf, 2021; Brossard et al., 2022). State estimation is the problem of determining a robot’s position, orientation, and velocity that are vital for robot control (Barfoot, 2017). An interesting class of state estimators that can be run at high frequency, e.g., 2 kHz, are based on Invariant Extended Kalman Filter (InEKF) (Barrau, 2015; Barrau and Bonnabel, 2017; Barrau and Bonnabel, 2018). The theory of invariant observer design is based on the estimation error being invariant under the action of a matrix Lie group. The fundamental result is that by correct parametrization of the error variable, a wide range of nonlinear problems can lead to (log) linear error dynamics (Bonnabel et al., 2009; Barrau, 2015; Barrau and Bonnabel, 2017).

Proprioceptive state estimators often combine data from an Inertial Measurement Unit (IMU) with signals such as body velocity, kinematics information, and contact events. A successful method in this domain for legged robots is the contact-aided InEKF (Hartley et al., 2020). This approach is attractive because the odometry estimate only depends on inertial, contact, and kinematic data, which barring sensor failure, always exist. Furthermore, the independence from any vision systems make the state estimator robust to perceptually degraded situations (Hartley et al., 2018; Lin et al., 2022). Many existing perception and navigation methods can work well, given a correct though uncertain initial condition; hence, such an accurate dead reckoning can enable higher levels of autonomy for existing systems.

The invariant observer design provides us with a framework with better convergence properties. However, sensory data input likewise plays a crucial role in state estimation tasks. Noisy and biased measurements can hinder the performance of the observer. On the other hand, sensor failures can lead to catastrophic results in state estimation. Recent deep learning methods allow one to address these challenges by estimating the bias or inferring the information that traditional sensors cannot obtain (Liu et al., 2018; Wellhausen et al., 2019). By combining learning with the symmetry-preserving observer design, the performance and robustness of a state estimator can be greatly improved (Brossard et al., 2019; Brossard et al., 2020).

This section reports our recent developments on deep-learning-aided invariant state estimator (Lin et al., 2022). In this work, a deep contact estimator is designed to estimate the foot contact events for legged robots. The learned foot contacts are then used to enforce the non-slip constraint in an InEKF. Although the complete state estimation pipeline is purely proprioceptive, it can achieve a similar performance to a state-of-the-art visual SLAM system. In addition, the program, including the deep contact estimator, runs in real-time (500 Hz) on an MIT Mini Cheetah robot. We also report our new results on developing the InEKF for wheeled platforms in Section 3.4. The data sets and software are available for download .

3.1 Deep contact estimator

The goal of the deep contact estimator is to accurately estimate the foot contact events where the robot’s foot maintain zero velocity in the world frame. We model the contact as binary events on each leg l ∈ {RF, LF, RH, LH}. The overall contact states of the robot becomes a collection of binary values $C = [\begin{matrix} c_{R F} & c_{L F} & c_{R H} & c_{L H} \end{matrix}]$ , where c_l ∈ {0, 1} with 0 indicates no contact, and 1 denotes a firm contact. For a quadruped robot, there exist 16 different combinations of the contact states. We formulate our approach as a classification task¹.

The contact estimator takes sensor measurements from an IMU, joint encoders, and kinematics as input. To allow the network to extract information from the time domain, a fixed number of past data is concatenated together before inputting into the network. Figure 4 lists the input data along with the network architecture. The linear block contains 3 fully-connected layers that convert the deep features into the 16 classes. Dropout mechanisms are also added to the first 2 fully-connected layers to prevent the network from overfitting. Finally, we employ the cross-entropy loss for the classification task.

FIGURE 4

FIGURE 4. The architecture of the proposed contact estimator (Lin et al., 2022). The inputs include linear accelerations and angular velocities from an IMU, joint angles and joint velocities from encoders, and foot positions and velocities from kinematics.

3.2 Contact data sets

We create open-sourced contact data sets using an MIT Mini Cheetah robot (Katz et al., 2019). The data sets are collected using an MIT controller (Kim et al., 2019) across 8 different terrains (shown in Figure 5). We record proprioceptive measurements such as joint encoders data, foot positions and velocities, IMU measurements, and estimated joint torques from the controller. The IMU measurements are received at 1000Hz, while other data are recorded at 500Hz. We upsample all measurements to match the IMU frequency after recording the data. In addition to the proprioceptive measurements, we also record RGB-D images with an Intel D455 camera mounted on top of the robot. These RGB-D images are used in a state-of-the-art visual SLAM algorithm, ORB SLAM2 (Mur-Artal and Tardós, 2017). For the grass data sets, we obtain ground truth trajectories from a motion capture system. However, for the rest of the data sets, we use the trajectory from ORB SLAM2 as an approximation to ground truth. In total, around 1,000,000 data points were collected on 8 different terrains. We also include some examples of the robot walking in the air to provide the network with negative examples by holding the robot up and applying the same controller commands. The detailed number of data collection is listed in Table 3. The labels of the ground truth contacts are generated automatically with an offline pre-processing algorithm (self-supervised learning). Detailed of the algorithm can be found in the work of (Lin et al., 2022).

FIGURE 5

FIGURE 5. (A) Setup of an MIT Mini Cheetah with the perception suite used in the data collection. (B) Different ground types in the contact data set.

TABLE 3

TABLE 3. Number of data of each terrain in the contact data sets.

3.3 Experimental results

We evaluate the accuracy, false positive rate, and false negative rate of the proposed contact estimator using the Mini Cheetah robot, as shown in Table 4. We compare our method with a model-based approach (Focchi et al., 2013; Fakoorian et al., 2016; Fink and Semini, 2020), denoted GRF Thresholding, and a fixed gait cycle assumption which assume the pre-determined gait cycle is precisely followed by the controller. Our method performs the best across all three sequences. It is worth noticing that the proposed contact estimator has the lowest false positive rate, which is crucial for state estimation tasks as the violation of the non-slip condition could lead to severe drift in the estimation.

TABLE 4

TABLE 4. Accuracy comparison against baselines. The proposed method achieves the highest accuracy on all sequences. Although the gait cycle method has an accuracy closer to the proposed method, it does not remove false positives when gait cycle is violated.

We integrated our contact estimator into the contact-aided InEKF. The entire state estimation pipeline, including our deep contact estimator, runs in real-time at 500 Hz on an NVIDIA Jetson AGX Xavier. Figure 6 shows the trajectory generated by the InEKF using different contact sources on a concrete loop sequence. We also run the filter using the ground truth contact data to serve as a reference. Qualitatively compared to the baseline contact detectors, the resulting trajectory with the proposed contact estimation has smaller drifts from the trajectory with ground truth contacts, especially in the height (Y) axis. Furthermore, compared to the baseline contact estimators, the proposed method also yields a smoother trajectory.

FIGURE 6

FIGURE 6. Concrete short loop test sequence. Top Left: The bird’s-eye view of the trajectories. The estimated trajectory is mapped to the camera frame (Y pointing downward, and Z pointing forward). Top Right: Zoomed-in of the bird’s-eye view. Bottom Left: This figure shows that the gait cycle and GRF thresholding methods produce a significant height (Y) drift. Bottom Right: Robot configuration.

3.4 Invariant EKF with body velocity measurements

In addition to legged robots, we also develop state estimation software for wheeled robots using the InEKF. Instead of using the foot contact, here we use the body velocity as measurements in the correction step. Although the implementation is not restricted to a specific platform, we evaluate the performance of the filter on a differential-drive wheeled robot, Husky, from Clearpath robotics. We obtain the body velocity measurements from wheel encoders using a simple differential-drive model, $v_{body} = \frac{r (ω_{l} + ω_{r})}{2}$ , where ω_l and ω_r are wheel angular velocities measured by the wheel encoders and r is the wheel radius. Moreover, we also use pseudo velocity measurements by assuming zero velocities on the Y and Z axis (Dissanayake et al., 2001). However, this estimation can be noisy and inaccurate due to slip or bumping on the wheels. In order to know the full potential of this framework, we also record several sequences in a motion capture facility and use the velocity from the motion capture system to correct the estimated state. Figure 7 shows the resulting trajectories. Using the wheel velocity and pseudo velocity measurements, the state estimator can produce a good estimation of the robot pose. If the accuracy of the velocity is improved, then the drift can be further reduced.

FIGURE 7

FIGURE 7. Top: Two sequences of trajectories recorded at the University of Michigan MAir motion capture facility. The green lines are the InEKF estimated trajectories using velocity estimated from the wheel encoders, and the blue lines are the InEKF results using velocity from the motion capture system. Bottom: The robot setup. The ground of the facility is planted with natural grass.

Although this section does not discuss the incorporation of learning into the InEKF state estimator, as done previously for the legged robot, the following lessons from our experiments are noteworthy.

• Body velocity measurements provide a generic correction model that can work on any robotic platform. However, accurate body velocity measurement is not readily available. Specifically, the filer requires the ground referenced body velocity (Teng et al., 2021b; Potokar et al., 2021).

• The robot’s nonholonomic constraints (i.e., velocity constraints that cannot be integrated) can provide pseudo observations that can significantly improve the performance. However, these constraints are assumptions and detached from the robot’s behavior. Learning such constraints provides a way to use sensory inputs instead of assumptions (Brossard et al., 2019; Brossard et al., 2020).

• Moreover, the nonholonomic constraints are violated when the robot drifts. Slip detection and friction estimation are challenging and necessary tasks for future learning-aided robot estimation modules.

4 Symmetry-preserving geometric robot control

The geometry of the configuration space of a robotics system can naturally be modeled using matrix Lie (continuous) groups (Bloch, 2015; Lynch and Park, 2017). For example, the centroidal dynamics of legged robots can be approximated by a single rigid body, whose motion is on SE(3).

The Euler angle based convex Model Predictive Control (MPC) (Di Carlo et al., 2018) has been proposed for locomotion planning on the quadrupedal robot. Zero roll and pitch angle assumptions are validated by assuming a flat ground, which may fail when such assumptions no longer hold. To avoid the problem, the geometric MPC that utilize the symmetry of the Lie group has been proposed. A local control law has been proposed by Kalabić et al. (2016); Kalabić et al. (2017), where the linearized dynamics are defined by a local diffeomorphism from the SE(3) manifold to $R^{n}$ space. However, such a diffeomorphism is not unique and too abstract for controller design.

The Variational Based Linearization (VBL) technique (Wu and Sreenath, 2015) are applied to linearize the Lagrangian to obtain the discrete-time equation of motion and applied to robot pose control (Chignoli and Wensing, 2020). A VBL based MPC is proposed by Agrawal et al. (2021) for locomotion on discrete terrain using a gait library. The result suggests that the VBL based linearization can preserve the energy, thus making the system more stable. However, the VBL method linearized the system at the reference trajectory, which may result in unstable motion (Ding et al., 2021). Other than linearizing at the reference trajectory, the work of Ding et al. (2021) linearized the system at the current operating point to obtain the QP problem for tracking of legged robot trajectory. However, the linearized state matrix of Ding et al. (2021) depends on the orientation, which can be avoided by exploiting the symmetry of the system as done by Teng et al. (2022a,b). The proposed framework is illustrated in Figure 8.

FIGURE 8

FIGURE 8. The proposed error-state MPC framework by Teng et al. (2022a). The tracking error is defined on a matrix Lie group and linearized in the Lie algebra. A convex MPC algorithm is derived via the linearized dynamics for tracking control. The proposed algorithm is applied to a single rigid body system and verified on a quadrupedal robot MIT Mini Cheetah. A quadratic cost function in the Lie algebra can verify the stability of the proposed MPC.

4.1 Error-state convex MPC

For tracking control on Lie group $G$ , we define the desired trajectory as $X_{d, t} \in G$ and the actual state as $X_{t} \in G$ , both as function of time t. Given the twists ξ_t and desired twists ξ_d,t and the reconstruction equation, we have $\frac{d}{d t} X_{t} = X_{t} ξ_{t}^{\land}, \frac{d}{d t} X_{d, t} = X_{d, t} ξ_{d, t}^{\land}$ . Similar to the left or right error defined in (Bullo and Murray, 1999), we define the error between $X_{t}^{d}$ and X_t as

Ψ_{t} = X_{d, t}^{- 1} X_{t} \in G . (6)

For the tracking problem, our goal is to drive the error from the initial condition Ψ₀ to the identity $I \in G$ . Taking derivative on both sides of (6), we have

\begin{align} \frac{d}{d t} Ψ_{t} & = {\dot{Ψ}}_{t} = \frac{d}{d t} (X_{d, t}^{- 1}) X_{t} + X_{d, t}^{- 1} \frac{d}{d t} X_{t} = X_{d, t}^{- 1} \frac{d}{d t} X_{t} - X_{d, t}^{- 1} \frac{d}{d t} (X_{d, t}) X_{d, t}^{- 1} X_{t} \\ = X_{d, t}^{- 1} X_{t} ξ_{t}^{\land} - X_{d, t}^{- 1} X_{d, t} ξ_{d, t}^{\land} X_{d, t}^{- 1} X_{t} = Ψ_{t} ξ_{t}^{\land} - ξ_{d, t}^{\land} Ψ_{t} . \\ {\dot{Ψ}}_{t} & = Ψ_{t} {(ξ_{t} - Ψ_{t}^{- 1} ξ_{d, t} Ψ_{t})}^{\land} = Ψ_{t} {(ξ_{t} - {A d}_{Ψ_{t}^{- 1}} ξ_{d, t})}^{\land} . \end{align} (7)

We define $ψ_{t}^{\land}$ as an element of the Lie Algebra that corresponds to Ψ_t. Thus by the exponential map, we have $Ψ_{t} = \exp (ψ_{t}), Ψ_{t} \in G, ψ_{t}^{\land} \in g$ . Given the first-order approximation of the exponential map, $Ψ_{t} = \exp (ψ_{t}) \approx I + ψ_{t}^{\land},$ and a first-order approximation of the adjoint map ${A d}_{Ψ_{t}} \approx {A d}_{I + {ψ_{t}}^{\land}}$ , we can linearize (7) by only keeping the first order term of ψ_t and ξ_t − ξ_d,t as:

{\dot{Ψ}}_{t} \approx (I + {\dot{ψ}}_{t}^{\land}) \approx (I + ψ_{t}^{\land}) {(ξ_{t} - {A d}_{(I - ψ_{t}^{\land})} ξ_{d, t})}^{\land}, (8)

{\dot{ψ}}_{t} = - {a d}_{ξ_{d, t}} ψ_{t} + ξ_{t} - ξ_{d, t} . (9)

Eq. 9 is the linearized velocity error in the Lie algebra.

The dynamics of ξ_t is described by the forced Euler-Poincaré equations (Bloch et al., 1996; Bloch, 2015) as $J_{b} \dot{ξ} = {a d}_{ξ}^{*} J_{b} ξ + u$ , where $u \in g^{*}$ is the generalized control input force applied to the body fixed principal axes, ad∗ is the co-adjoint action, and $g^{*}$ is the cotangent space. This model is nonlinear. To compute a locally linear approximation of the nonlinear term, we adopt the Jacobian linearization around the operating point $\bar{ξ}$ as $J_{b} \dot{ξ} \approx {a d}_{\bar{ξ}}^{*} J_{b} \bar{ξ} + \frac{\partial {a d}_{ξ}^{*} J_{b} ξ}{\partial ξ} |_{\bar{ξ}} (ξ - \bar{ξ}) + u$ . Thus, we have the linearized dynamics in the following form $\dot{ξ} = H_{t} ξ + J_{b}^{- 1} u + b_{t}$ , We define the system states as $x_{t} ≔ [\begin{matrix} ψ_{t} \\ ξ_{t} \end{matrix}]$ . Then, the linearized dynamics becomes ${\dot{x}}_{t} = A_{t} x_{t} + B_{t} u_{t} + h_{t}$ , where

A_{t} ≔ [\begin{matrix} - {a d}_{ξ_{d, t}} & I \\ 0 & H_{t} \end{matrix}], B_{t} ≔ [\begin{matrix} 0 \\ J_{b}^{- 1} \end{matrix}], h_{t} ≔ [\begin{matrix} - ξ_{d, t}, \\ b_{t} \end{matrix}] .

4.2 Convex MPC design

On Lie groups, our cost function is designed to regulate the tracking error ψ_t and its derivative ${\dot{ψ}}_{t}$ rather than the difference between ξ_d,t and ξ_t. Thus, our tracking error can be designed as:

y_{t} ≔ [\begin{matrix} ψ_{t} \\ {\dot{ψ}}_{t} \end{matrix}] = [\begin{matrix} I & 0 \\ - {a d}_{ξ_{d, t}} & I \end{matrix}] x_{t} - [\begin{matrix} 0 \\ ξ_{d, t} \end{matrix}] . (10)

Given some semi-positive definite weights P, Q, and R, we can now write the quadratic cost function as

N (y_{t_{f}}) = y_{t_{f}}^{T} P y_{t_{f}}, L (y_{t}, u_{t}) = y_{t}^{T} Q y_{t} + u_{t}^{T} R u_{t} . (11)

Given the future twists ξ_d,t, initial error state ψ₀ and twist ξ₀, we can define all the matrices. Discretizing the system at time steps ${t_{k}}_{k = 1}^{N}$ , we can design the MPC as follows.

Problem 2. Find $u_{k} \in g^{*}$ such that

\begin{array}{l} \min_{u_{k}} & y_{N}^{T} P y_{N} + \sum_{k = 1}^{N - 1} y_{k}^{T} Q y_{k} + u_{k}^{T} R u_{k} \\ s.t. & x_{k + 1} = A_{k} x_{k} + B_{k} u_{k} + h_{k}, u_{k} \in U_{k}, x_{0} = x (0) . \end{array}

In Problem 2, A_k, B_k, and h_k can be obtained by zero-order hold or Euler first-order integration. Problem 2 is a QP problem that can be solved efficiently, e.g., using OSQP (Stellato et al., 2020).

4.3 Stability analysis

The stability of the proposed controller can be verified by a quadratic Lyapunov cost function in Lie algebra. First, we introduce the left invariant inner product. Then, we can derive the gradients of the quadratic cost function in the tangent space.

Definition 1. Given $ϕ_{1}, ϕ_{2} \in R^{\dim g}$ and $ϕ_{1}^{\land}, ϕ_{2}^{\land} \in g$ , we define the inner product ${⟨ ϕ_{1}^{\land}, ϕ_{2}^{\land} ⟩}_{g} = ϕ_{1}^{T} P ϕ_{2}$ , where P is a positive definite matrix. This inner product is left-invariant. To see this, suppose $X ϕ_{1}^{\land}, X ϕ_{2}^{\land} \in T_{X} G$ , $\forall X \in G$ , then ${⟨ X ϕ_{1}^{\land}, X ϕ_{2}^{\land} ⟩}_{X} = {⟨ {(ℓ_{X^{- 1}})}_{*} X ϕ_{1}^{\land}, {(ℓ_{X^{- 1}})}_{*} X ϕ_{2}^{\land} ⟩}_{g} = {⟨ ϕ_{1}^{\land}, ϕ_{2}^{\land} ⟩}_{g}$ , where ${(ℓ_{X^{- 1}})}_{*} = X^{- 1} : T_{X} G \to g$ is the pushforward map.

Theorem 1. Consider the state $X \in G$ , $ϕ \in R^{\dim g}$ , and X = exp(ϕ). We consider the metric in Definition 1. The function $h = \frac{1}{2} ‖ ϕ ‖_{P}^{2}$ is a candidate Lyapunov function and the gradient of h with respect to X is ∇h = Xϕ^∧.

Finally, we show that a linear feedback in Lie algebra can regulate the state to the identity exponentially.

Theorem 2. Consider the state in Theorem 1 as a trajectory. Let $ξ^{\land} \in g$ . The system $\dot{X} = X ξ^{\land}$ can be exponentially stabilized to X = I by linear feedback ξ = Kϕ, where K is a gain matrix that is Hurwitz.

The detailed proof of the theorems are presented in the work of Teng et al. (2022b). For the proposed MPC, we can follow the same steps and estimate the region of attraction. For the unconstrained case, the resulting LQR problem will lead to a linear feedback that can be verified by Theorem 2.

4.4 Validation on quadrupedal robot

We conduct two experiments on the quadrupedal robot Mini Cheetah (Katz et al., 2019) to evaluate the proposed MPC. Both experiments use a single rigid body model to approximate the torso motion. We apply MIT controller (Di Carlo et al., 2018) with the proposed MPC to plan the Ground Reaction Force (GRF).

4.4.1 Robot pose tracking

In this experiment, a mixture of roll and yaw reference angle is applied for tracking. The reference signals and snapshots of robot motion are presented in Figure 9. Each controller is implemented three times. The details of the responses are presented in Figure 10. It can be seen that as no feedforward force at the equilibrium is provided, all controllers have steady-state error. However, the geometric-based controller, i.e., proposed and the VBL based MPC, has a smaller steady-state error than the Euler angle-based one. As the VBL based MPC does not conserve the scale of the error, the convergence rate is much lower than our controller, especially when the opposite Euler angle signal is applied at the middle of the reference profile.

FIGURE 9

FIGURE 9. Reference signal for roll and yaw angle tracking. From 1 to 11 s, the robot roll changes from 0 to -57.3° and yaw changes from 0 to 28.5°. Then the robot leans to the opposite side for 10 s.

FIGURE 10

FIGURE 10. Error convergence for roll and yaw tracking. When a new step signal is applied, our controller converges faster than the baseline methods and has a smaller steady-state error. The Euler angle-based MPC has a larger steady-state error as both roll and yaw signals are applied.

4.4.2 Robot trotting

We also apply our controller to robot locomotion. Ours and baseline controllers are deployed to plan the robot’s GRF given command twists. Then the GRF is applied to the Whole Body Impulse Control (WBIC) (Kim et al., 2019) to obtain the joint torques. Unlike the conventional whole-body controller, WBIC prioritizes the GRF generation by penalizing the deviation of GRF from the planned GRF. We increase the penalty for the GRF by 1e4 times in the original WBIC, so the GRF merely deviates from the planned one.

We first apply a step signal in yaw rate. Then we add a step signal in x motion in the robot frame, and the yaw rate becomes a sinusoidal signal. The reference is presented in Figure 12 and the snapshots of the experiments are in Figure 11. We find that ours and the VBL-MPC can better track the yaw rate than the Euler angles-based MPC, as expected. As the orientation and position tracking errors are small because every step is integrated from the current state, it is reasonable that all controllers perform well in position tracking. The result can be seen in Figure 12.

FIGURE 11

FIGURE 11. Snapshots of the experiments on reference tracking in Mini Cheetah trotting. The time corresponds to the reference signal in Figure 12.

FIGURE 12

FIGURE 12. Reference tracking for quadrupedal robot trotting. Each controller is tested three times. The responses are too noisy; thus, the results are smoothed using the moving average filter.

5 Equivariant representation learning: Augmenting geometry with learning

Learning equivariant representation of geometric data can provide efficiency and generalizability in challenging robot perception tasks. Loosely speaking, equivariance is a property for a map such that given a transformation in the input, the output changes in a predictable way determined by the input transformation. Mathematically the equivariance is represented as commutativity: a function f : X → X is equivariant to a set of transformations G, if for any g ∈ G, g ⋅ f(x) = f(g ⋅ x), ∀x ∈ X. For example, applying a translation on a 2D image and then going through a convolution layer is identical to processing the original image with a convolution layer and then shifting the output feature map. Therefore convolution layers are translation-equivariant.

An equivariant network captures the inherent symmetry of data, disentangling the information dependent on and independent of the transformations. As an analogy, this is akin to the notion of coordinates-free calculations on manifolds in modern mathematrics. In a coordinates-free setup, one can distinguish the intrinsic properties of the problem from those of a particular choice of coordinates (Tu, 2011). We mainly focus on the rigid body transformations, decoupling the poses and the pose-independent information, e.g., shapes and semantics, from the geometric data by leveraging equivariant feature learning.

5.1 Point cloud registration with SO(3)-equivariant implicit shape representations

We proposed an initialization-independent rotation registration method for point clouds by leveraging a SO(3)-equivariant feature learner (Zhu et al., 2022b). An overview of the network structure is depicted in Figure 13. A point cloud is mapped to a feature space equipped with SO(3) rotations represented as 3 × 3 matrix multiplications, consistent with the input Euclidean space. Therefore, the rotational registration can be approached by solving the Orthogonal Procrustes problem in the feature space. Our method achieved accurate rotation registration regardless of initial estimation error. It also implies that our method falls in the correspondence-free category, where the step of data association, i.e., matching corresponding points in two point clouds, is not needed.

FIGURE 13

FIGURE 13. Overview of the SO(3)-equivariant registration network (Zhu et al., 2022b). The point cloud input is of shape $R^{N \times 3}$ , and the encoded feature is of shape $R^{C \times 3}$ . N is the number of points, and C is the dimension of features. Occupancy field is a function $v (p) \mapsto [0,1], p \in R^{3}$ mapping any 3D coordinate to an occupancy value. The rotation is estimated by aligning the features using Horn’s method (Horn et al., 1988).

The SO(3)-equivariant feature learning is realized through a backbone network called Vector Neuron (Deng et al., 2021). The key idea is to augment the scalar feature in each feature dimension to a vector in $R^{3}$ . In Vector Neuron networks, the feature matrix with feature dimension C corresponding to a set of N points is $V \in R^{N \times C \times 3}$ . The mapping between layers can be written as $f : R^{N \times C_{l} \times 3} \to R^{N \times C_{l + 1} \times 3}$ , where l is the layer index. Following this design, the representation of SO(3) rotations in feature space is straightforward: g(R) ⋅ V≔VR, where g(R) denotes the rotation operation in the feature space, parameterized by the 3 × 3 rotation matrix R ∈ SO(3). Here we ignore the first dimension N of V for simplicity, and the SO(3)-equivariance of the linear layer: f_lin(V) = WV, where $W \in R^{C_{l + 1} \times C_{l}}$ , can be easily verified as follows.

g (R) \cdot f_{lin} (V) = W V R = f_{lin} (g (R) \cdot V) . (12)

For further discussions beyond the linear layers, see the work of Deng et al. (2021).

We design an encoder-decoder structure to learn the features. We also improve the robustness to noise in sampled points by decoding an implicit shape representation following the Occupancy network (Mescheder et al., 2019). Our method is tested on the synthetic object-wise data set ModelNet40 (Wu et al., 2015), shown in Table 5. For further experiments using real-world indoor RGB-D data set 7Scenes (Shotton et al., 2013), see the work of (Zhu et al., 2022b).

TABLE 5

TABLE 5. Rotational registration error given rotated copies of point clouds. Tested using ModelNet40 (Wu et al., 2015) official test set. The best are shown in bold. The second best are shown in italic. All values are in degrees.

5.2 Efficient SE(3)-equivariant representations learning

Our recent work (Zhu et al., 2022a) extends the SO(3)-equivariance to SE(3)-equivariance to better deal with arbitrary rigid body transformations of 3D point-cloud data. We use Convolutional Neural Networks (CNNs) which inherit translational equivariance. Existing work of equivariant convolutional networks are mainly in two types. First is regular G-CNNs (G for group) (Cohen and Welling, 2016), which lift the domain of the feature function space from the input Euclidean space to the group of transformations of interest. Second is steerable G-CNNs (Thomas et al., 2018), which leave the domain of the feature function space untouched but design the codomain to be steerable with the stabilizer subgroup. More detailed introductions can be found in the work of Cohen et al. (2018). The former strategy consumes much larger memory than a conventional CNN, while the latter usually results in complex design and restrictions on the kernel and convolution structure, both limiting broader applications in practice. We propose a new strategy to lift the domain of feature space to a proper subgroup of SE(3), and to apply a trivial steering representation on the subgroup, which addresses both problems mentioned above. Our proposed point-cloud convolution network learns expressive SE(3)-equivariant features with a much smaller footprint than existing methods. See Table 6 for a comparison between our method and a baseline regular G-CNN method, EPN (Chen et al., 2021).

TABLE 6

TABLE 6. Experiment result of pose estimation on ModelNet40 data set (Wu et al., 2015) on the plane category. Two numbers are shown for GPU memory consumption and running speed for training/inference separately, given the same input size for two methods. Notice that the numbers are not directly comparable to Table 5 due to different experiment settings.

To be more specific, our convolution structure is built upon KPConv (Thomas et al., 2019). We choose SO(2) as the stabilizer and work with feature maps defined on the domain $\tilde{X} = SE (3) / S O (2)$ which is homeomorphic to the Cartesian product $S^{2} \times R^{3}$ . We extend the KPConv from $R^{3}$ to $S^{2} \times R^{3}$ . We discretize SO(3) into the icosahedral rotation group $I$ with 60 elements, following EPN by Chen et al. (2021), containing all rotational symmetries of an icosahedron. SO(2) is discretized as the group of multiples of 72° planar rotations, which is a cyclic group of degree 5. Then we obtain a discretization of the sphere $\bar{S^{2}} = \bar{S O (3)} / \bar{S O (2)}$ of size 12 corresponding to the vertices of an icosahedron, where $\bar{\cdot}$ (a top bar) denotes the discretized space. As a result, the domain of feature maps in our network is $\bar{S^{2}} \times R^{3}$ . It turns out that we can design an SE(3)-equivariant convolution in this space in a simple and efficient form while maintaining expressiveness. The full $\bar{S O (3)}$ information can be recovered from the $\bar{S^{2}}$ feature maps through a permutation layer. An overview of the network structure is shown in Figure 14.

FIGURE 14

FIGURE 14. A high-level illustration of our efficient SE(3)-equivariant network. We lift the convolution to $S^{2} \times R^{3}$ , which is a rare choice for SE(3)-equivariant feature learning. The different colors represent channels in S².

5.3 Place recognition via SE(3)-invariant representation

Place recognition, also known as loop closure detection, enables a robot to determine if it has seen a place before and provides loop closure candidates for SLAM algorithms to eliminate accumulated error. The widely used sensors include RGB, Stereo, Thermal, Event-Triggered, and RGB-D, which are in the form of 2D structured images or 3D unstructured points (Barros et al., 2021). For general tasks with 2D images, place recognition tasks suffer less because the training and testing images differ trivially in roll direction during data collecting procedures. Yet, the roll angles deviate significantly in challenging scenarios like surgery (Song et al., 2021), underwater robot (Li et al., 2015) or special camera setup in general cases. Orientational differences widely exist and pose great difficulty to place recognition with 3D unstructured point cloud perception. Therefore, place recognition methods can benefit from a representation that is robust to arbitrary transformations of 3D point cloud data.

The image-based localization can be categorized as constructing hand-crafted rotation-invariant descriptors in 2D (Cummins and Newman, 2008; Gálvez-López and Tardos, 2012), learning the global descriptor (Kendall et al., 2015; Sünderhauf et al., 2015; Kim et al., 2017) or a combination of both (Tian et al., 2020; Song et al., 2022). Although learning-based methods achieve better accuracy and robustness, Lowry et al. (2015) suggested that place-recognition scenarios with large orientation differences still rely on hand-crafted descriptors which are designed for robust feature matching. This is especially true for 3D point clouds suffering more from orientation differences. Existing point cloud-based place recognition methods improve the transformation robustness by extracting 3D hand-crafted rotation-invariant descriptors (Kim and Kim, 2018; Wang et al., 2019; Yin et al., 2019; Kim et al., 2021) and randomly rotating them during training (Uy and Lee, 2018; Cattaneo et al., 2021). However, hand-crafted features can lose structural information and these methods do not take translation into consideration.

To avoid an exhaustive data augmentation with all possible transformations and improve generalizability, we propose an SE(3)-invariant place recognition representation network for the 3D point cloud. An overview of the network structure is shown in Figure 15. We use EPN (Chen et al., 2021) to extract SE(3)-invariant local features. NetVLAD (Arandjelovic et al., 2016) is applied to aggregate local features and construct SE(3)-invariant global descriptors.

FIGURE 15

FIGURE 15. Overview of the SE(3)-invariant place recognition network. In this network, SE(3)-invariant features are learned from input point clouds. The local feature extraction consist of SE(3) point convolution, SE(3) group convolution, batch normalization followed by leaky ReLU activation, and pooling layer which makes equivariant features invariant. Global descriptors are computed by aggregating local features using NetVLAD. The output descriptors can perform place recognition tasks.

We evaluate the proposed place representation using the Oxford RobotCar (Maddern et al., 2017) benchmark created by Uy and Lee (2018). The precision-recall curves of the proposed method and other baseline methods are shown in Figure 16. The proposed network EPN-NetVLAD outperforms the baselines. To show the generlizability of the proposed method, we experiment with three in-house data sets of a university sector (U.S.), a residential area (R.A.) and a business district (B.D.) (Uy and Lee, 2018). The result is shown in Table 7 and our method performs better in all the data sets that we did not train on.

FIGURE 16

FIGURE 16. Experimental results of proposed method (EPN-NetVLAD, blue line), state-of-the-art approaches PointNetVLAD (Uy and Lee, 2018), Scan Context (Kim and Kim, 2018), and M2DP (He et al., 2016) on Oxford RobotCar benchmark.

TABLE 7

TABLE 7. Experiment result showing the average recall (%) at top 1% for each of the models. Both methods are only trained on Oxford (Maddern et al., 2017) and tested on other different data sets (Uy and Lee, 2018).

6 Closing remarks and future opportunities

Autonomy via computational intelligence is a multifaceted research domain that nicely integrates mathematics, computer science, and engineering and can have enormous impacts on our future and improve our quality of life. Robotics plays a unique role by connecting the real world to AI, i.e., embodied AI. Many challenges in robotics are natural problems in AI because they show what it takes to develop an autonomous system capable of operating in the wild. We reviewed some of the recent efforts in symmetry-preserving robot perception and control methods. In particular, by symmetry, we refer to invariance or equivariance properties under a group action enabled by Lie groups or their discrete subgroups.

The RKHS registration framework presented in Section 2 provides a unified model for registration that jointly integrates geometric and semantic measurements and does not require explicit data association. This framework is intimately connected with deep learning models. The inner product of the functions viewed as cross-correlation can be modeled as a network layer to combine the power of functional modeling with feature and kernel learning. Moreover, since the framework is equivariant, it can be directly combined with equivariant feature learners, e.g., via deep kernel learning (Wilson et al., 2016). An important open problem is a relationship among our framework, discrete-continuous smoothing and mapping (Doherty et al., 2022), dynamic scene graphs (Rosinol et al., 2021), and learning-aided smoothing and mapping (Huang et al., 2021) for robot perception and navigation. These are attractive research directions that we will explore in the future.

The learning-aided state estimation framework, presented in Section 3, can be extended to multi-task networks (Liu et al., 2019; Maninis et al., 2019; Hu and Singh, 2021) for tasks such as slip detection and friction coefficient estimation (Focchi et al., 2018; Romeo and Zollo, 2020), terrain classification (Hoepflinger et al., 2010; Walas et al., 2016; Wu et al., 2016; Ahmadi et al., 2021), covariance estimation (Brossard et al., 2020), sensor calibration and integration (Liu et al., 2020; Brossard et al., 2022; Ji et al., 2022), and motion mode detection (Brossard et al., 2019). A high-frequency implementation of these works on robots can significantly improve their capabilities for navigating challenging environments. Moreover, the work of Hwangbo et al. (2019) designs a learning-based controller using a policy network that maps kinematic observations and the joint state history to the joint position targets. Then an actuator network takes the joint velocity history and joint position error history to learn the joint torque. The success of Hwangbo et al. (2019) suggests that our multimodal approach to learning can improve the controller performance while further optimization of the contact estimation network size is possible.

In Section 4, we developed a new error-state MPC approach on connected matrix Lie groups for robot control. By exploiting the existing symmetry of the pose control problem on Lie groups, we showed that the linearized tracking error dynamics and equations of motion in the Lie algebra are globally valid and evolve independently of the system trajectory. In addition, we formulated a convex MPC program for solving the problem efficiently using QP solvers. A Lyapunov function expressed in Lie algebra is introduced to verify the exponential stability of the proposed controller. The experimental results confirm that the proposed approach provides faster convergence when rotation and position are controlled simultaneously. Future work will implement the trajectory optimization using this geometric control framework proposed by Teng et al. (2022b) for robot control. Another interesting research direction is to incorporate learning into this framework (Shi et al., 2019; Li et al., 2022; Ma et al., 2022; O’Connell et al., 2022; Power and Berenson, 2022; Rodriguez et al., 2022). In addition, the IIG algorithm (Ghaffari Jadidi et al., 2019), combined with an MPC (Teng et al., 2021a), can provide an integrated kinodynamic planner that takes the robot stability, control constraints, and the value of information from sensory data into account. Gan et al. (2022) show that the value of information can be learned from multimodal sensory input via learning from demonstrations and self-supervised trajectory ranking to deal with sub-optimal demonstrations.

In Section 5, we showed how equivariant neural networks can serve as powerful feature learners to improve data efficiency and generalizability across different tasks. In particular, we provided results on registration and place recognition tasks. We argue that our efficient SE(3)-equivariant network (Zhu et al., 2022a) can be a reliable feature learner for a variety of robot perception and control problems, including those mentioned in this article. Furthermore, this symmetry-preserving representation can be an answer to the long-standing question of a “good” representation for robot mapping.

In addition to the point cloud-based SE(3)-invariant place recognition, it is of great interest to investigate the image-based version in challenging scenarios ranging from unstructured outdoors to endoscopy and colonoscopy (Song et al., 2021). Cohen and Welling (2016), Cohen et al. (2018) provide valuable insights on equipping the existing learning-based algorithms with group-invariant feature extraction ability. Traditional hand-crafted descriptors can be substituted with learnable deep SE(3)-invariant image descriptors. More importantly, we believe a natural future direction for robotics is towards developing structure-preserving and correct-by-construction computational models, such as our SE(3)-equivariant network, to enable efficient and generalizable multimodal learning.

Finally, this article aims to serve as an invitation to developing algorithms that respect the geometry of problems in robotics, preserve structures such as symmetry, and use modern computation methods such as deep learning. We presented methods ranging from purely geometric to end-to-end learning. As such, the central message of this paper is not about outperforming a particular framework, but it lies in the combined power of geometry and learning and the possibility of modeling traditional geometric problems using geometric networks such as equivariant deep networks. The latter will lead to explainable large-scale computational models for robotics and autonomous systems.

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/supplementary material.

Author contributions

MG developed the main narrative and led manuscript preparation. RZ and TianL developed the RKHS registration framework in Section 2. CL developed the SE(3)-invariant place recognition work in Section 5.3. T-YL and TinL developed the learning-aided InEKF in Section 3. ST developed the MPC on Lie group work in Section 4. JS developed the SE(3)-invariant place recognition work in Section 5.3 and helped with the organization of the paper.

Funding

Toyota Research Institute provided funds to support this work. Funding for M. Ghaffari was in part provided by NSF Award No. 2118818. This work was also supported by MIT Biomimetic Robotics Lab and NAVER LABS.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Footnotes

¹https://github.com/UMich-CURLY/deep-contact-estimator; https://github.com/UMich-CURLY/cheetah_inekf_realtime; https://github.com/UMich-CURLY/husky_inekf.

References

Agrawal, A., Chen, S., Rai, A., and Sreenath, K. (2021). Vision-aided dynamic quadrupedal locomotion on discrete terrain using motion libraries. arXiv preprint arXiv:2110.00891 13

Google Scholar

Ahmadi, A., Nygaard, T., Kottege, N., Howard, D., and Hudson, N. (2021). Semi-supervised gated recurrent neural networks for robotic terrain classification. IEEE Robot. Autom. Lett. 6, 1848–1855. doi:10.1109/lra.2021.3060437