In this study, HcVGH, a method that learns spatio-temporal categories by segmenting first-person-view (FPV) videos captured by mobile robots, is proposed. Humans perceive continuous high-dimensional information by dividing and categorizing it into significant segments. This unsupervised segmentation capability is considered important for mobile robots to learn spatial knowledge. The proposed HcVGH combines a convolutional variational autoencoder (cVAE) with HVGH, a past method, which follows the hierarchical Dirichlet process-variational autoencoder-Gaussian process-hidden semi-Markov model comprising deep generative and statistical models. In the experiment, FPV videos of an agent were used in a simulated maze environment. FPV videos contain spatial information, and spatial knowledge can be learned by segmenting them. Using the FPV-video dataset, the segmentation performance of the proposed model was compared with previous models: HVGH and hierarchical recurrent state space model. The average segmentation F-measure achieved by HcVGH was 0.77; therefore, HcVGH outperformed the baseline methods. Furthermore, the experimental results showed that the parameters that represent the movability of the maze environment can be learned.
Autonomous vehicles require precise and reliable self-localization to cope with dynamic environments. The field of visual place recognition (VPR) aims to solve this challenge by relying on the visual modality to recognize a place despite changes in the appearance of the perceived visual scene. In this paper, we propose to tackle the VPR problem following a neuro-cybernetic approach. To this end, the Log-Polar Max-Pi (LPMP) model is introduced. This bio-inspired neural network allows building a neural representation of the environment via an unsupervised one-shot learning. Inspired by the spatial cognition of mammals, visual information in the LPMP model are processed through two distinct pathways: a “what” pathway that extracts and learns the local visual signatures (landmarks) of a visual scene and a “where” pathway that computes their azimuth. These two pieces of information are then merged to build a visuospatial code that is characteristic of the place where the visual scene was perceived. Three main contributions are presented in this article: 1) the LPMP model is studied and compared with NetVLAD and CoHog, two state-of-the-art VPR models; 2) a test benchmark for the evaluation of VPR models according to the type of environment traveled is proposed based on the Oxford car dataset; and 3) the impact of the use of a novel detector leading to an uneven paving of an environment is evaluated in terms of the localization performance and compared to a regular paving. Our experiments show that the LPMP model can achieve comparable or better localization performance than NetVLAD and CoHog.
Grid cells are crucial in path integration and representation of the external world. The spikes of grid cells spatially form clusters called grid fields, which encode important information about allocentric positions. To decode the information, studying the spatial structures of grid fields is a key task for both experimenters and theorists. Experiments reveal that grid fields form hexagonal lattice during planar navigation, and are anisotropic beyond planar navigation. During volumetric navigation, they lose global order but possess local order. How grid cells form different field structures behind these different navigation modes remains an open theoretical question. However, to date, few models connect to the latest discoveries and explain the formation of various grid field structures. To fill in this gap, we propose an interpretive plane-dependent model of three-dimensional (3D) grid cells for representing both two-dimensional (2D) and 3D space. The model first evaluates motion with respect to planes, such as the planes animals stand on and the tangent planes of the motion manifold. Projection of the motion onto the planes leads to anisotropy, and error in the perception of planes degrades grid field regularity. A training-free recurrent neural network (RNN) then maps the processed motion information to grid fields. We verify that our model can generate regular and anisotropic grid fields, as well as grid fields with merely local order; our model is also compatible with mode switching. Furthermore, simulations predict that the degradation of grid field regularity is inversely proportional to the interval between two consecutive perceptions of planes. In conclusion, our model is one of the few pioneers that address grid field structures in a general case. Compared to the other pioneer models, our theory argues that the anisotropy and loss of global order result from the uncertain perception of planes rather than insufficient training.
Frontiers in Robotics and AI
AI Techniques for Improving Guidance, Navigation, and Control in Extreme Exploration