^{1}

^{*}

^{1}

^{2}

^{1}

^{1}

^{1}

^{1}

^{2}

Edited by: Di Wu, Southwest University, China

Reviewed by: Jinwei Xing, Google, United States

Adam Safron, Johns Hopkins University, United States

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Loop closure detection is an important module for simultaneous localization and mapping (SLAM). Correct detection of loops can reduce the cumulative drift in positioning. Because traditional detection methods rely on handicraft features, false positive detections can occur when the environment changes, resulting in incorrect estimates and an inability to obtain accurate maps. In this research paper, a loop closure detection method based on a variational autoencoder (VAE) is proposed. It is intended to be used as a feature extractor to extract image features through neural networks to replace the handicraft features used in traditional methods. This method extracts a low-dimensional vector as the representation of the image. At the same time, the attention mechanism is added to the network and constraints are added to improve the loss function for better image representation. In the back-end feature matching process, geometric checking is used to filter out the wrong matching for the false positive problem. Finally, through numerical experiments, the proposed method is demonstrated to have a better precision-recall curve than the traditional method of the bag-of-words model and other deep learning methods and is highly robust to environmental changes. In addition, experiments on datasets from three different scenarios also demonstrate that the method can be applied in real-world scenarios and that it has a good performance.

Loop closure detection is the process of identifying the places that a robot has visited before, which can help the robot relocate when it loses its trajectory due to motion blur, forming a topologically consistent trajectory map (Gálvez-López and Tardis,

Traditional loop closure detection methods are generally based on appearance (Cummins and Newman,

However, appearance-based methods usually depend on traditional handcrafted features, such as SIFT (Lowe,

Recently, given the rapid development of deep learning in computer vision (Bengio et al.,

At the same time, the attention mechanism can weigh key information and ignore other unnecessary information to process information with higher accuracy and speed. Hou et al. (

For traditional methods, the problem of false detection is easy to occur when facing similar environments or relatively large changes in illumination, which leads to serious errors in map estimation. In this research paper, we propose a loop closure detection method based on a variational autoencoder to solve the loop closure detection problem in visual SLAM. The method uses intermediate layer depth features instead of the traditional manual features and compares the current image with the previous keyframes to detect the loop. The method incorporates an attentional mechanism in the neural network to obtain more useful features and also improves the loss function of the network and eliminates erroneous loops through geometric consistency.

Function diagram of loop detection based on a variational autoencoder.

The proposed network structure is shown in

Feature extraction network structure of SENet-VAE.

The proposed network is based on VAE. The network input is an RGB image with a resolution of 192 × 256. The encoder maps the image to a normal distribution through the latent variables μ and σ, and then the information of the potential variables is decoded by the decoder. It describes the observation of the potential space in a probabilistic way. In addition, this method adds an attentional mechanism to the VAE to increase the weight of the effective features to obtain better results.

Inspired by Sikka et al. (

Assume that the network input data

where

It is hoped that the generative model will learn a model

This model is controlled by the parameter θ. Therefore, an appropriate goal is to maximize the marginal likelihood of the observed data

For _{ϕ}(_{ϕ}(

Rewrite the above equation as the Lagrange equation under the Karush-Kuhn-Tucker (KKT) condition:

Since β, ε ≥ 0, according to the complementary relaxation degree KKT condition, Equation (6) can be rewritten to obtain the β −

As the value β becomes larger, _{ϕ}(

After sampling from the standard normal distribution ε, the latent variable _{r}, as follows in Equation (8) and the maximum cross-entropy loss function _{s} to account for class bias, as follows in Equation (9):

Here _{i} and _{i} represent the label of the input image and the probability of the positive class output by the network behind the softmax function, respectively. M represents the number of categories, _{ic} is the sign function (0 or 1), and _{ic} is the probability that the observation sample i belongs to category c, which is obtained by the softmax function.

In the encoder part, the weight of the two encoders is shared in the form of a triple network, and a sample is selected from the dataset called anchor. Samples of the same type as the anchor are selected. Distortion or darkening operations are performed, and the movement of the camera is imitated to a certain extent. This type of image is called a positive image. In the data of the current training batch, the sample that is different from the anchor is called a negative image. Anchor, positive image, and negative image consist of a triplet. The global image descriptor is taken from the latent variable μ. With the descriptors of a baseline image _{a}, a positive image _{p}, and a negative image _{m}, the triplet loss function is defined as follows in Equation (10):

where

This loss function expressed by _{t} forces the network to learn to use m to distinguish the similarity between positive and negative images. The minimization of the damage function is obtained by minimizing the cosine similarity between the reference image and the negative image and maximizing the similarity between the reference image and the positive image.

Finally, the overall objective function is defined as follows in Equation (11):

where λ_{i} is the weight factor to balance the impact of each project.

The attention mechanism squeeze-and-excitation networks (SENet) (Hu et al.,

The SENet module in this article changes the input from the previous pooling layer: _{tr}:^{H′×W′×C′}, ^{H×W×C} and transmits it to the next layer. Then, the output can be written as follows in Equation (12):

Here _{tr} is the pooling operator, _{1}, _{2}, …, _{C}] represents the filter, _{c} represents the parameters of the c-th filter,

The goal is to ensure that the network is sensitive to its informative features so that they can be exploited subsequently and suppress useless features. Therefore, before the response enters the next transformation, it is divided into three parts, namely, squeeze, excitation, and scale, to recalibrate the filter response.

First, the squeeze operation encodes the entire spatial feature into a global feature by using global average pooling. Specifically, each two-dimensional feature channel is turned into a real number, which has a global receptive field to some extent, and the output dimension matches the number of input feature channels. It represents the global distribution of the response on the feature channel and enables the layers to be close to the input to obtain the global receptive field.

The statistic ^{C} is generated by reducing the set

The second part is the excitation operation, which fully captures the channel dependencies by utilizing the information gathered in the squeeze operation. This part consists of two fully connected layers. The first layer is a dimension reduction layer with the parameter

Finally, the scale operation part multiplies the learned activation values of each channel (sigmoid activation, value 0 to 1) by the original features on

The construction of the squeeze-and-excitation block in the network is shown in

The squeeze-and-excitation block of SENet-VAE.

In this section, we use the neural network described above to extract the image features and use it to perform back-end image feature matching to achieve loop closure detection. During the image-matching process, key point mismatches are eliminated by geometric checking, which improves the accuracy of detection.

The global descriptor for the image is taken from the output of the convolutional layer where the latent variable μ is located in the sampling layer of the network. After the encoder, the latent variable ^{(I)} denoting the corresponding output for a given input image, which is shown in Equation (16):

For the extraction of image key points, the method proposed by Garg et al. (

In order to detect loop closures, first build a database of historical image descriptors through global image descriptors. When the image to be queried is input, the global image descriptor is used to perform a K-Nearest neighbor search in the established database, and images with relatively high similarity scores are selected to form a candidate image set. Then, K candidates are screened in the candidate set through the key points described before, and the random sample consensus (RANSAC) algorithm is used to filter out false matches. The RANSACN algorithm finds an optimal homography matrix H through at least four sets of feature-matching point pairs, and the size of the matrix is 3 × 3. The optimal homography matrix H is supposed to satisfy the maximum number of matching feature points. Since the matrix is often normalized by making _{33} = 1, the homography matrix, which is expressed by Equation (17), has only eight unknown parameters:

where (

Then, the homography matrix is used to test other matching points under this model. Use this model to test all the data, and calculate the number of data points and projection errors that satisfy this model through the cost function. If this model is the optimal model, the corresponding cost function should obtain the minimum value. The equation for calculating the cost function

After filtering out invalid matches, the matched key points can be used to calculate the effective homography matrix as the final matching result. An example of final matches after performing RANSAC can be seen in

An example of final matches after performing RANSAC.

In this section, the feasibility and performance of the proposed method will be tested on the Campus Loop dataset. The hyperparameters used in the experiments are shown in

List of hyperparameters.

Learning rate | η | 10^{−3} |

Input image size | 192 × 256 | |

Batch size | _{T} |
12 |

Weight function | λ_{0} |
10^{−4} |

Weight function | λ_{1} |
10^{−4} |

Weight function | λ_{2} |
1.0 |

Weight function | λ_{3} |
1.0 |

Beta parameter | β | 250 |

Margin parameter | 0.5 |

The accuracy rate describes the probability that all the loops extracted by the algorithm are real loops, and the recall rate refers to the probability of being correctly detected in all real loops. The functions are as follows in Equations (19) and (20):

The accuracy rate and recall rate are, respectively, used as the vertical axis and horizontal axis of the precision-recall rate curve. There are four types of results for loop closure detection, as shown in

Classification of loop closure detection results.

True loop | True positive (TP) | False positive (FP) |

False loop | False negative (FN) | True negative (TN) |

The Campus Loop dataset (Merrill and Huang,

On this dataset, the proposed method is compared with the following methods: (1) CNN—Zhang et al. (

Comparison of precision-recall curves.

Results of precision-recall curves (the closer the AUC is to 1.0, the higher the accuracy of the detection method is; higher maximum recall means more false detections can be avoided).

In order to further test the effectiveness of the proposed method in a practical environment, we selected sequence images of three complex scenes (sequence numbers 00, 05, and 06) in the KITTI odometry dataset (Geiger et al.,

List of dataset parameters.

Image size | 1,241 × 376 | 1,241 × 376 | 1,241 × 376 |

Number of images | 4,540 | 2,760 | 1,100 |

Trajectory length |
3,724.187 | 2,205.576 | 1,232.876 |

In this experiment, the image resolution is adjusted to 192 × 256. The final experimental results are presented in

Results of loop closure detection using KITTI-odometry [sequence 00]:

Results of loop closure detection using KITTI-odometry [sequence 05]:

Results of loop closure detection using KITTI-odometry [sequence 06]:

For performance evaluation, the number of occurrences of loop closure detection and the accuracy of correctly matching images for each sequence were counted. The test results are shown in

Loop closure detection results under different environments (KITTI dataset of sequence numbers 00, 05, and 06).

As mentioned before, this research paper proposes to incorporate an attention mechanism in the network to filter image features based on feature relevance to improve the performance of the network. This section analyzes the improvement effect of the network from a quantitative point of view and

Ablation experiments on different modules of the network.

Ours | 1,575.23 | 0.7723 | |||

Ours | 42.73 | 0.8042 | |||

Ours |

The bold values represent the final values obtained by the model after improving the loss function and adding the attention mechanism.

In this research paper, a loop closure detection method based on a variational autoencoder is documented, which uses a neural network to learn the representation of the image from the original image to replace the traditional handicraft features. We incorporate an attention mechanism in the coding layer of the neural network, which can automatically obtain the feature weight of each feature channel, and then improve the performance of the network in extracting image features by improving the features that are useful for the current task and suppressing the useless ones according to this feature weight. At the same time, the loss function of the variational autoencoder (VAE) is improved. By adding a hyperparameter β to the second KL divergence term of the loss function, the VAE shows better disentanglement ability and improves the performance and convergence of the network. Experiments on the Campus Loop dataset show that the proposed method can maintain high accuracy at a high recall rate. In addition, experiments on the datasets for three different scenarios indicate that the method is robust to environmental changes, and can maintain high accuracy even in the presence of viewing angle changes and object occlusions. Our future work will consider lightweight design and modification of the method to adapt it to practical high-speed scenarios.

The datasets presented in this article are not readily available because the datasets used or analyzed during the current study are available from the corresponding author on reasonable request. Requests to access the datasets should be directed at: SS,

SS: Conceptualization, Funding acquisition, Writing—review & editing. FY: Conceptualization, Methodology, Validation, Writing—original draft, Writing—review & editing. XJ: Validation, Writing—review & editing. JZ: Software, Writing—review & editing. WC: Validation, Writing—review & editing. XF: Visualization, Writing—review & editing.

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This research was funded by the National Natural Science Foundation of China, grant numbers 62103245 and 62073199, the Natural Science Foundation of Shandong Province for Innovation and Development Joint Funds, grant number ZR2023MF067, the Natural Science Foundation of Shandong Province, grant number ZR2023MF067, and the Shandong Province Science and Technology Small and Medium-Sized Enterprise Innovation Capability Enhancement Project, grant number 2023TSGC0897.

XJ was employed by the company Yantai Tulan Electronic Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.