Mining local and global spatiotemporal features for tactile object recognition

The tactile object recognition (TOR) is highly important for environmental perception of robots. The previous works usually utilize single scale convolution which cannot simultaneously extract local and global spatiotemporal features of tactile data, which leads to low accuracy in TOR task. To address above problem, this article proposes a local and global residual (LGR-18) network which is mainly consisted of multiple local and global convolution (LGC) blocks. An LGC block contains two pairs of local convolution (LC) and global convolution (GC) modules. The LC module mainly utilizes a temporal shift operation and a 2D convolution layer to extract local spatiotemporal features. The GC module extracts global spatiotemporal features by fusing multiple 1D and 2D convolutions which can expand the receptive field in temporal and spatial dimensions. Consequently, our LGR-18 network can extract local-global spatiotemporal features without using 3D convolutions which usually require a large number of parameters. The effectiveness of LC module, GC module and LGC block is verified by ablation studies. Quantitative comparisons with state-of-the-art methods reveal the excellent capability of our method.


Introduction
Robots perceive objects around them mainly through touch and vision.Although vision can intuitively capture the appearance of an object, it cannot capture the basic object properties, such as mass, hardness and texture.In addition, many limitations exist for vision perception (Lv et al., 2023a).Consequently, the tactile object recognition (TOR) task is proposed to predict the category of the object being grasped by robot, which can provide support for subsequent grasping operations without being constrained by the aforementioned conditions.
TOR has a broad range of applications in life, including descriptive analysis in the food industry (Philippe et al., 2004), electronic skin (Liu et al., 2020) and embedded prostheses in the biomedical field (Wu et al., 2018), and postdisaster rescue (Gao et al., 2021), etc.The rapid development of deep learning (Lv et al., 2023c) has led to tremendous progress in various fields (Qian et al., 2020(Qian et al., , 2023a,b,d,e,f;,b,d,e,f;Huo et al., 2023;Li et al., 2023;Xie et al., 2024), and deep learning based TOR are the mainstream methods (Liu et al., 2016;Ibrahim et al., 2022;Yi et al., 2022).Currently, TORs using deep learning can be divided into two categories: one uses a 2D CNN to extract features from each frame of tactile data and then fuses the features of each frame to recognize the grasped object based on the fused features and the other uses a 3D CNN to extract spatial and temporal features from tactile frames for recognition.
Traditional TOR methods mostly adopt the methods in the first category, which use sensors (primarily pressure sensor arrays) to acquire tactile information, then the tactile data are sent to a 2D CNN for feature extraction and category prediction.Gandarias et al. (2017) used a 2D CNN to extract high-resolution tactile features and trained a support vector machine (SVM) using these features.The trained SVM was used to predict the object category.Bottcher et al. (2021)  Recently, the another category of methods has achieved remarkable results and has become mainstream.Qian et al. (2023c).used a gradient adaptive sampling (GAS) strategy to process the acquired tactile data and subsequently fed the data into a 3D CNN network to extract multiple scale temporal features.The features were fused at the fully connected and outputted prediction category.Inspired by the optical flow method (Cao et al., 2018) used not only original tactile data but also tactile flow and intensity differences as input data.These data underwent convolution, weighting, and other operations on different branches and were ultimately fused at the fully connected layer to infer the results.Other related works include Kirby et al. ( 2022) and Lu et al. (2023), etc.
The first category of methods extracts only the features of each frame and does not use temporal information between frames, therefore, their overall performance is limited.In the second category, spatiotemporal features are extracted via 3D CNNs, and the overall performance is better than that of the first category of methods.However, existing methods utilize single scale

FIGURE
Framework of our method.
convolution operations to extract features in spatial and temporal dimensions, and they cannot simultaneously extract local and global features very well.
This paper proposes a local-global spatiotemporal feature extraction scheme to solve the above problems.First, a local convolution (LC) module is proposed, which utilizes the interaction of adjacent temporal information and 2D convolution to extract local spatiotemporal features.Next, the global convolution (GC) module is proposed to extract global spatiotemporal features and the module combines multiple 1D and 2D [(1+2)D] convolutions which can extend the receptive field (Lv et al., 2023b) in spatial and temporal dimensions.Finally, this article achieves accurate TOR by comprehensively utilizing local and global spatiotemporal features.
The main contributions are as follows: 1.This paper proposes a local convolution (LC) module which extracts spatiotemporal features by using the interaction of adjacent temporal information and 2D convolution operation.
2. This paper proposes a global convolution (GC) module which extracts global spatiotemporal features by fusing multiple 1D and 2D convolutions which can expand the receptive field in temporal and spatial dimensions.
3. Our method achieves the highest object recognition accuracy on two public datasets by comprehensively using local-global spatiotemporal features.

. Gradient adaptive sampling
Unlike uniform sampling and sparse sampling, GAS uses the pressure gradient to guide the adaptive sampling.The specific approach is to normalize the accumulated gradient over the T period and then divide it into multiple intervals, randomly selecting one point from each interval.This process obtains multiple data frames, which are subsequently fed into the network. .

MR D-network
The MR3D-18 network is proposed to address the problems that the size of tactile frames is small and overfitting is easily occur, which removes a pooling operation and adds a dropout layer to the ResNet3D-18 (Hara et al., 2018) network for handling above problems.

Proposed method . Overview
As shown in Figure 1, the P frames of tactile data are adaptively selected from all frames by using GAS.Then, they are fed into local and global residual (LGR-18) network to extract local and global spatiotemporal features.Finally, the features are imported into a fully connected layer and a softmax classifier to predict categories.Our LGR-18 network is primarily composed of multiple local and global convolution (LGC) blocks, which is proposed in this paper, and the LGC block consists of two pairs of LC and GC modules.

Global average pooling
The LC and GC modules, LGC block and LGR-18 network will be precisely explained in the following sections.

. Local convolution module
The LC module focuses on extracting the local spatiotemporal features of tactile data.As shown in Figure 2, the size of the input features X is [N, T, C, H, W], where N denotes the batch size, T and C separately denote the quantity of input frames and feature channels, H and W denote the height and width of the features respectively.First, the LC module utilizes the temporal shift (TS) operation to extract local temporal features.
The TS operation is shown in Figure 3.A tensor with C channels and T frames is also shown in Figure 3.The features of the different time stamps are shown in different colors in each row.Along the temporal dimension, the TS operation shifts one channel in forward direction and one channel in backward direction.We utilize the abandoning and zero-padding operation to address the problems of excessive and missing features.It is worth noting that replacing the zero padding features by the abandoning features is infeasible because it will destroy the temporal sequence.After the TS operation, a 2D convolution layer is utilized to extract the local spatial features.Next, we utilize a global average pooling (GAP) and a sigmoid function to extract the local spatiotemporal weights of each channel.Finally, we utilize a simple method to extract local spatiotemporal features by performing channel-wise product between the input feature X and the local spatiotemporal weight S. A residual connection is employed to prevent the loss of crucial information in the original features.Finally, the LC module, which relies on TS operations and traditional 2D convolutions as its core, extracts local features.

. Global convolution module
Inspired by the conventional depthwise separable convolution, multiple (1+2)D convolutions are utilized to extract global spatiotemporal features from tactile data, where the 1D and 2D convolutions are used to extract temporal and spatial features, respectively.However, the innovation of GC module does not lie in the depthwise separable convolution.As shown in Figure 4, the core idea of GC module is that the receptive field of convolution kernels is continuously enlarged to extract global spatiotemporal features via iterative residual connections and convolutions.The detailed procedure can be seen in Equations 1, 2 and Figure 4. T,C,H,W (1) In Equation 1, Y 1 is the output of the first branch and is identical to the input of the first branch.Y 2 , Y 3 and Y 4 represent the outputs of the 2nd, 3rd and 4th branches, respectively, in the GC module.The conv (1+2)D is the same as (1+2)D convolutions.The parameters for conv (1+2)D are 3 and 3-3.As shown in Equation 2, the final output of GC module, denoted as Z, is obtained by fusing the outputs of four branches: . Architecture of the LGR-network As shown in Table 1, our LGR-18 network is modified from the MR3D-18 network.First, in the LGR-18 network, the (1+2)D convolutions replaced the 7 × 7 × 7 convolution layer in the MR3D-18 network.Second, in the LGR-18 network, multiple 3 × 3 × 3 3D convolution layers are replaced with LGC blocks, which is proposed in this paper, and two 3D convolution layers are approximately equivalent to an LGC block.Third, the LGR-18 network adds (1+2)D convolutions with size of 1 in the Res 3 , Res 4 , and Res 5 layers to change the number of channels.Finally, the LGC block consists of two pairs of LC and GC modules connected by residual connections.Table 1 shows that the LGR-18 network has two advantages over the MR3D-18 network.
2. The LGR-18 network can extract local and global spatiotemporal features through the LGC blocks.

. Training scheme
To enhance the performance of the LGR-18 network, we utilize the large-scale Kinetics-400 dataset (Carreira and Zisserman, 2017), which consists of 400 human action categories and at least 400 video clips in each category, to pre-train our method.Subsequently, we utilized the UCF101 (Soomro et al., 2012) and target datasets to pretrain the LGR-18 network.
To address the problem of large dataset sizes, the size of input data is adjusted to 32 × 32 when the LGR-18 network pre-trained on the Kinetics-400 and UCF101 datasets.The traditional crossentropy loss, denoted as L, is employed to optimize the LGR-18 network, which is formulated Equation 3: where K denotes the number of categories, v ′ k denotes the prediction score of category k, and v k denotes the label of category k.

Experiment . Experiment setup . . Datasets
The LGR-18 network is verified on the MIT-STAG (Sundaram et al., 2019) and iCub datasets (Soh et al., 2012)  The iCub dataset is acquired by two anthropomorphic dexterous hands of the iCub humanoid robot platform.Each anthropomorphic hand is equipped with five fingers, each with 20 movable joints.Additionally, each finger is equipped with pressure sensors to acquire tactile data.The iCub dataset includes 2,200 frames with 10 categories, i.e., monkey toy, med vitamin water, med coke, lotion, vitamin water, full cola, empty vitamin water, empty coke, book and blue bear (toy), and the size of each frame is 5 × 12.For each category in the iCub dataset, 132 frames (totaling 1,320 frames) are selected as the training set, and 88 frames are selected (totaling 880 frames) as the testing set.

. . Implementation details
This paper uses the top 1 score, kappa coefficient (KC) and confusion matrix for evaluation.The stochastic gradient descent is used to optimize our module.The momentum and decay rate are 0.9 and 0.0001, respectively.The initial learning rate is 0.002 and the quantity of epochs is 50.The learning rate decreases to 10% of the previous stage after every 10 epochs.The batch sizes are 32 and 8 for the MIT-STAG and iCub datasets, respectively.
The experiments were all performed on the PyTorch framework and run on a workstation with two NVIDIA GeForce RTX 2080 Ti (2 × 11 GB).

. Ablation study . . Ablation study of LGC block
As Table 2 shows, the LGC block is compared with the other four convolution blocks to verify its effectiveness.Architecture A is used as a baseline and is composed of two 3D convolution layers connected by residual connection.Architecture B replaces one of the 3D convolutions layers in architecture A with an LC module, and architecture C uses the GC module to Frontiers in Neurorobotics frontiersin.orgreplace one of the 3D convolutions layers in architecture A.
In architecture D, a pair of LC and GC modules are utilized to replace one of two 3D convolution layers in architecture A. Architecture E is our method, and it uses two pairs of LC and GC modules to replace all the 3D convolution layers in architecture A.
Ablation studies reveal the effectiveness of the LC module, GC module, and combination of the LC and GC modules to compare architectures B, C, and D with A. By comparing architecture E with A, ablation studies strongly demonstrate that the best performance can be achieved by using a combination of LC and GC modules.

. . Ablation study of spatial pooling operation in LC module
As shown in Figure 2, the spatial pooling operation is involved in the LC module, therefore, the max pooling and GAP are compared with each other to determine who is more appropriate for LC module.As shown in Table 3, the top 1 score and KC of GAP are higher than the ones of max pooling, consequently, the GAP is adopted by LC module.
. Parameter analysis . .Parameter analysis for the number of shifting channels As shown in Figure 3, the number of shifting channels is an important hyperparameter for the LC module, therefore, it is quantitatively analyzed in this section.It is worth noting that the number of shifting channels must be even because the shifting operation is bidirectional.As shown in Figure 5, the top 1 score achieves the highest value when the number of shifting channels is set to 2, which means that shifting many channels is not suitable for the LC module because it can induce the information reduction.

. . Parameter analysis for dropout rate
As shown in Table 1, the dropout layer is used to prevent the overfitting problem, therefore, the dropout rate is quantitatively analyzed in this section.As shown in Figure 6, the top 1 score achieves the highest value when the dropout rate is set to 0.3.

. Comparisons with state-of-the-art methods
To demonstrate the overall effectiveness of our model, we conducted a comprehensive quantitative comparison with 5 methods on the MIT-STAG dataset (Sundaram et al., 2019), i.e., STAG (Sundaram et al., 2019), Smart-hand (Wang et al., 2021), ResNet10-v1 (Zhang et al., 2021), Tactile-ViewGCN (Sharma et al., 2022), and GAS-MR3D (Qian et al., 2023c), and 5 methods on the iCub dataset (Soh et al., 2012), i.e., DS (Soh and Demiris, 2014), GS (Soh and Demiris, 2014), STORK-GP (Soh et al., 2012), As shown in Table 4, our method achieved the highest top 1 score and KC.This demonstrates that our model has the best prediction accuracy and the lowest level of confusion on the MIT-STAG dataset.A comparison of the confusion matrices in Figure 7 further supports the conclusions above.As Table 5 and Figure 8 show, both our method and Qian et al. achieved a recognition accuracy of 100%.Our method and that of Qian et al. yield the highest detection accuracy and the lowest level of confusion on the iCub dataset.
In summary, our method is superior to 5 advanced methods.

Conclusion
A novel LGR-18 network is proposed to address the problem that the current TOR models cannot simultaneously extract local and global features very well.The LGR-18 network consists primarily of multiple traditional 1D and 2D convolution kernels and LGC blocks, which is proposed in this paper.The LGC block is formed by combining LC and GC modules through residual connections.The LC module mainly utilizes a temporal shift operation and a 2D convolution to extract local spatiotemporal features.The GC module extracts global spatiotemporal features by fusing multiple 1D and 2D convolutions which can expand the receptive field in temporal and spatial dimensions.In this paper, we utilize the LGR-18 network to extract local and global spatiotemporal features while mitigating the issue of large parameter in existing 3D CNN models.Ablation studies verify the validity of the LC module, GC module, and LGC block.A comprehensive quantitative comparison between our method and 5 advanced methods on the MIT-STAG and iCub datasets reveal the excellent capability of our method.
The future work of our team includes two parts.The first part is combining our method with video based object detection method, and the another part is deploying our method on more robots.Frontiers in Neurorobotics frontiersin.org collected tactile data via two different tactile sensors and subsequently input the data into a 2D CNN to extract features and infer results.Other related works include Sundaram et al. (2019), Chung et al. (2020), and Carvalho et al. (2022), etc.

FIGURE
FIGUREIllustration of LC module.

FIGURE
FIGUREIllustration of temporal shift operation.

FIGURE
FIGUREIllustration of GC module.
TABLE Comparison between MR D-and LGR-network, where the size of input tactile data is × × .
TABLE Ablation study of LGC block on the MIT-STAG dataset.
. The MIT-STAG dataset includes 26 categories of common objects and empty hands,TABLE Ablation study of LGC block on the MIT-STAG dataset.
FIGUREThe top score (%) of di erent number of shifting channels on the MIT-STAG dataset.i.e., allen key set, ball, battery, board eraser, bracket, stress toy, cat, chain, clip, coin, gel, kiwano, lotion, mug, multimeter, pen, safety glasses, scissors, screwdriver, spoon, spray can, stapley, tape, tea box, full cola can, and empty cola can.A total of 88,269 valid frames are collected, each with a size of 32 × 32.To achieve accurate prediction results under fair conditions, 1,353 frames (totaling TABLE Comparisons with methods in terms of the top score (%) and KC (%) on the MIT-STAG dataset.
TABLE Comparisons with methods in terms of the top score (%) and KC (%) on the iCub dataset.