# NEW ADVANCES AT THE INTERSECTION OF BRAIN-INSPIRED LEARNING AND DEEP LEARNING IN AUTONOMOUS VEHICLES AND ROBOTICS

EDITED BY : Guang Chen, Pascual Campoy, Changhong Fu and Caixia Cai PUBLISHED IN : Frontiers in Neurorobotics

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88963-971-7 DOI 10.3389/978-2-88963-971-7

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# NEW ADVANCES AT THE INTERSECTION OF BRAIN-INSPIRED LEARNING AND DEEP LEARNING IN AUTONOMOUS VEHICLES AND ROBOTICS

Topic Editors: Guang Chen, Tongji University, China Pascual Campoy, Polytechnic University of Madrid, Spain Changhong Fu, Tongji University, China Caixia Cai, Agency for Science, Technology and Research (A\*STAR), Singapore

Citation: Chen, G., Campoy, P., Fu, C., Cai, C., eds. (2020). New Advances at the Intersection of Brain-Inspired Learning and Deep Learning in Autonomous Vehicles and Robotics. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88963-971-7

# Table of Contents

*04 Robust Learning Control for Shipborne Manipulator With Fuzzy Neural Network*

Zhiqiang Xu, Wanli Li and Yanran Wang

*15 Mobile Robot Path Planning Based on Ant Colony Algorithm With A\* Heuristic Method*

Xiaolin Dai, Shuai Long, Zhiwen Zhang and Dawei Gong

*24 Neural Network Based Uncertainty Prediction for Autonomous Vehicle Application*

Feihu Zhang, Clara Marina Martinez, Daniel Clarke, Dongpu Cao and Alois Knoll

*41 SVM-Based Classification of sEMG Signals for Upper-Limb Self-Rehabilitation Training*

Siqi Cai, Yan Chen, Shuangyuan Huang, Yan Wu, Haiqing Zheng, Xin Li and Longhan Xie

*51 Toward a Brain-Inspired System: Deep Recurrent Reinforcement Learning for a Simulated Self-Driving Agent*

Jieneng Chen, Jingye Chen, Ruiming Zhang and Xiaobin Hu

*60 Deep Recurrent Neural Networks Based Obstacle Avoidance Control for Redundant Manipulators*

Zhihao Xu, Xuefeng Zhou and Shuai Li

*73 A Novel Model for Arbitration Between Planning and Habitual Control Systems*

Farzaneh Sheikhnezhad Fard and Thomas P. Trappenberg


Florian Mirus, Peter Blouw, Terrence C. Stewart and Jörg Conradt

# Robust Learning Control for Shipborne Manipulator With Fuzzy Neural Network

#### Zhiqiang Xu, Wanli Li\* and Yanran Wang

School of Mechanical Engineering, Tongji University, Shanghai, China

The shipborne manipulator plays an important role in autonomous collaboration between marine vehicles. In real applications, a conventional proportional-derivative (PD) controller is not suitable for the shipborne manipulator to conduct safe and accurate operations under ocean conditions, due to its bad tracing performance. This paper presents a real-time and adaptive control approach for the shipborne manipulator to achieve position control. This novel control approach consists of a conventional PD controller and fuzzy neural network (FNN), which work in parallel to realize PD+FNN control. Qualitative and quantitative tests of simulations and real experiments show that the proposed PD+FNN controller achieves better performance in comparison with the conventional PD controller, in the presence of uncertainty and disturbance. The presented PD+FNN eliminates the requirements for precise tuning of the conventional PD controller under different ocean conditions, as well as an accurate dynamics model of the shipborne manipulator. In addition, it effectively implements a sliding mode control (SMC) theory-based learning algorithm, for fast and robust control, which does not require matrix inversions or partial derivatives. Furthermore, simulation and experimental results show that the angle compensation deviation of the shipborne manipulator can be improved in the range of ±1 ◦ .

#### Edited by:

Caixia Cai, Agency for Science, Technology and Research (A\*STAR), Singapore

#### Reviewed by:

Yingbai Hu, Technische Universität München, Germany Dawei Gong, Juilliard School, United States

#### \*Correspondence:

Wanli Li 1710213@tongji.edu.cn

Received: 14 January 2019 Accepted: 13 March 2019 Published: 04 April 2019

#### Citation:

Xu Z, Li W and Wang Y (2019) Robust Learning Control for Shipborne Manipulator With Fuzzy Neural Network. Front. Neurorobot. 13:11. doi: 10.3389/fnbot.2019.00011 Keywords: shipborne manipulator, real-time adaptive control, conventional PD controller, fuzzy neural network, sliding mode control, experiment verification

## 1. INTRODUCTION

The shipborne manipulator has become the most important tool in achieving autonomous cargo reloading between marine vehicles. With the use of the shipborne manipulator, onboard physical labor can be greatly reduced. However, unpredictable ship motion has a great impact on the maneuverability of the manipulator in real applications due to the complexity of the marine environment. If the sea state reaches Level-4, i.e., the height of a sea wave is larger than 1.52 m and wind speed exceeds 10.8 m/s, arm movement of the manipulator is extremely limited because of the influence of sling inertia and non-linear ship pose variation. In this case, the operational capacity of the manipulator is reduced by more than 50% or the manipulator is even temporarily suspended.

For the shipborne manipulator, changing boom inclination is realized through the expansion and contraction of its amplitude cylinder. The energy-saving and vibration-damping function of its accumulator is of great significance for improving manipulator control. In literature, the cylinder-accumulator in a manipulator has been studied in different types of engineering applications (Xiao et al., 2014; Shen et al., 2015; Zhao et al., 2017; Xia et al., 2018).

Specifically, the accumulator is a key component in the design of hydraulic hybrid structures, which ensures acceptable shock absorption performance and system energy consumption. Its different designs are used in other construction machines such as a rock drill (Yang et al., 2017), Scraper (Junke and Zhen, 2017), and Fast Forging Press (Zhang et al., 2016). However, there are few investigations related to the influence of the accumulator on the valve control system. According to the valve-controlled cylinder-accumulator model, the accumulator is used as an energy-saving and oil damping source in parallel with the rodless cavity of the amplitude cylinder. **Figure 1** shows the details of connecting the accumulator with the cylinder.

In literature, the conventional PD controller is often used to control different types of manipulators (Cervantes and Alvarez-Ramirez, 2001; Alvarez-Ramirez et al., 2003; Su et al., 2010). However, it is not suitable for controlling the hydraulic system discussed in this work, due to its high-order non-linearity, time-varying and hysteresis characteristics. It cannot control the hydraulic system in time and is vulnerable to environmental interference. Additionally, a significant steady-state error still exists, even if plenty of time is used to tune the appropriate

values for the conventional PD controller. In order to control the manipulator, calculating torque is the simplest control approach, but this approach relies on the accurate mechanical model of the system. To overcome this issue, the model-free approach has gained respectable attention since it does not require the precise model of the system and is more robust in response to uncertainty and disturbance.

The fuzzy logic controller (FLC) is widely applied to handle uncertainty and disturbance in many systems (Hasanien and Matar, 2015; Dabbaghjamanesh et al., 2016; Vaidyanathan and Azar, 2016), especially in different kinds of robots (Fu et al., 2013; Tai et al., 2016; Sarabakha et al., 2018). However, the FLC also requires a lot of time in order to tune the proper parameters to achieve a satisfied control performance. Recently, the FLC has been combined with an artificial neural network (ANN), i.e., fuzzy neural network (FNN), to overcome the aforementioned weakness of the FLC. In literature, the FNN has been successfully applied in identification and non-linear system controlling (Lin et al., 2015; Tang et al., 2017; He and Dong, 2018). At the same time, the sliding mode control (SMC) theory-based algorithm has been presented as a faster learning approach for tuning the FNN parameters, due to its faster convergence speed and higher robustness to uncertainty and disturbance (Lin et al., 2014). Moreover, the FNN, trained with the SMC theory-based algorithm, has been successfully used in controlling a spherical rolling robot (Kayacan et al., 2013), a robotic arm (Wai and Muthusamy, 2013), and a gyroscope (Yan et al., 2017).

In this work, an FNN is proposed to work in parallel with a conventional PD controller to achieve a PD+FNN controller. It not only overcomes the original defects of conventional PD control but also significantly enhances self-learning abilities and adaptabilities. In order to validate the proposed control strategy, simulation and experimental tests have been implemented. The qualitative and quantitative results show that the presented strategy is feasible and practical. In addition, it outperforms the conventional PD controller. The main contributions of this work are listed below:


The organization of this paper is as follows: In section 2, the dynamic model of the shipborne manipulator is introduced. In section 3, the PD+FNN control strategy is described. In section 4, different simulation tests are conducted in order to verify the proposed control strategy. In section 5, the real experimental tests on the swaying platform are performed to validate the proposed controller. Finally, conclusions are drawn in section 6.

#### 2. DYNAMIC MODEL

**Table 1** shows the simulation parameters. **Figure 2** shows the flow direction of the oil. The accumulator is linked with the rodless cavity of the cylinder. When the cylinder extends, the accumulator and the valve supply oil to the cylinder simultaneously, making the cylinder stretch out faster. The coupling dynamics model of the parallel accumulator of the valve-controlled cylinder system is established based on the flow continuity equation and the dynamic equation.

#### 2.1. Cylinder Dynamic Equation

The cylinder dynamic equation can be obtained by ignoring the cylinder cavity pressure, which is defined as:

$$pA = m\_l \ddot{\mathbf{x}} + B\_P \dot{\mathbf{x}} + k\_z \mathbf{x} \,, \tag{1}$$

where p denotes the working pressure, N/m<sup>2</sup> ; A denotes the action area of rodless cavity, m<sup>2</sup> ; m<sup>t</sup> denotes the mass of the piston, kg; x denotes the displacement of pistol, m; B<sup>P</sup> denotes the viscosity damping coefficient, N · s/m; k<sup>z</sup> denotes the spring stiffness, N/m.

#### 2.2. Cylinder Flow Equation

The cylinder flow continuity equation is:

$$Q\_L = A\dot{\mathbf{x}} + \mathbf{C}\_l \mathbf{p} \,, \tag{2}$$

where C<sup>i</sup> denotes the internal leakage coefficient, m<sup>5</sup> /N · s; Q<sup>L</sup> denotes the total flow into rodless cavity of cylinder, m<sup>3</sup> /s, which is defined as:

$$Q\_L = Q\_1 + Q\_X \,, \tag{3}$$

where Q<sup>1</sup> denotes the oil flow from valve, m<sup>3</sup> /s; Q<sup>X</sup> denotes the oil flow from the accumulator, m<sup>3</sup> /s. Accumulator energy release process can be regarded as interference according to Gaussian distribution, i.e.,:

$$Q\_{\overline{X}} = \mathcal{N}\left(\mu\_{\nu}, \sigma\_{\nu}^{2}\right), \tag{4}$$

where µ<sup>v</sup> and σ<sup>v</sup> are the mean value and the standard deviation of the accumulator output flow, respectively.

#### 2.3. Valve Flow Equation

The spool flow is a function of the working pressure and the displacement of the spool. The spool can be viewed as a zeroopen four-way spool valve. The valve flow equation is defined as follows:

$$Q\_1 = C\_d \omega \chi\_\nu \sqrt{\frac{2}{\rho} \left(p\_S - p\_1\right)} = k\_q \chi\_\nu \,, \tag{5}$$

where C<sup>d</sup> denotes the flow coefficient of valve; w denotes the valve area gradient, m<sup>2</sup> /m; x<sup>v</sup> denotes the displacement of valve core, m; x<sup>v</sup> = k<sup>v</sup> ·µ, k<sup>v</sup> denotes the spool scale factor; µ denotes current signal; ρ denotes the oil density, kg/m<sup>3</sup> ; ps denotes the system pressure, N/m<sup>2</sup> ; p<sup>1</sup> denotes the pressure of the rodless cavity, N/m<sup>2</sup> ; k<sup>q</sup> denotes the flow gain coefficient, m<sup>2</sup> /s, which is defined as:

$$k\_q = C\_d w \sqrt{\frac{2}{\rho} \left(p\_S - p\_1\right)}\,,\tag{6}$$

Assume that the state of the system is x<sup>1</sup> = x, x<sup>2</sup> = ˙x, the dynamic model of the valve-controlled cylinder system can be defined as follows:

$$\begin{cases}
\dot{\mathbf{x}}\_1 = \mathbf{x}\_2 \\
\dot{\mathbf{x}}\_2 = \theta\_1 \mathbf{x}\_1 + \theta\_2 \mathbf{x}\_2 + \mathbf{g}\mu + d \\
\mathbf{x} = \mathbf{x}\_1
\end{cases},\tag{7}$$

where

$$\begin{cases} \theta\_1 = -\frac{k\_t}{m\_l} \\ \theta\_2 = -\left(\frac{A^2 + C\_i B\_P}{m\_l C\_i}\right) \\ g = \frac{k\_q k\_\vartheta A}{m\_l C\_i} \\ d = \frac{\mathcal{N}(\mu\_\vartheta, \sigma\_\vartheta^2)A}{m\_l C\_i} \end{cases} \tag{8}$$

#### 3. PD+FNN CONTROL STRATEGY

#### 3.1. Overview of Control Strategy

**Figure 3** shows the presented PD+FNN control strategy, in which the conventional PD controller works in parallel with the fuzzyneuro controller, as the FNN block shows in **Figure 3**. The PD controller is utilized to not only trace the target value by system error, i.e., e = xpref − xp, but also to provide learning errors to train the FNN online. The FNN is supposed to improve the control accuracy and offset the effects of system interference.

#### 3.2. Fuzzy Neural Network Construction

The proposed FNN consists of two input signals, i.e., x<sup>1</sup> = e and x<sup>2</sup> = ˙e, and one output signal x<sup>f</sup> . Takagi-Sugeno-Kang (TSK) fuzzy model (Lin et al., 2015; Precup et al., 2015) is used in which the antecedent part is the fuzzy set and the consequent part consists of only crisp numbers. The rth rule of a zero-order TSK model with two input variables x<sup>1</sup> and x<sup>2</sup> can be defined as follows:

$$\text{IF } \mathfrak{x}\_1 \text{ is } M\_{1\dot{t}} \text{ and } \mathfrak{x}\_2 \text{ is } M\_{2\dot{\jmath}}, \text{ THEN } f\_{\ddot{\jmath}} = d\_{\ddot{\imath}\dot{\jmath}}\text{ }, \tag{9}$$

where fij is the time-varying parameter of the consequent part. dij is the coefficient of the output function for the rth rule, and M1<sup>i</sup> and M2<sup>j</sup> are fuzzy sets. Therefore, the inputs can be represented as µ1<sup>i</sup> and µ2<sup>j</sup> , respectively. The firing strength of the rth rule is computed as the T-norm (multiplication) of the member functions (MFs) in the antecedent part (Imanberdiyev and Kayacan, 2018):

$$W\_{ij} = \mu\_{1i}(\mathbf{x}\_1)\mu\_{2j}(\mathbf{x}\_2)\,,\tag{10}$$

The output signal of the system can be derived using the normalized values of the firing strength W˜ ij with the following form (Biglarbegian et al., 2010):

$$u\_f = \sum\_{i=1}^{I} \sum\_{j=1}^{J} f\_{ij} \tilde{W}\_{ij} \, , \tag{11}$$

where J and I represent the number of MFs for x<sup>2</sup> and x1, respectively. W˜ ij is expressed as follows:

$$
\tilde{W}\_{\vec{ij}} = \frac{W\_{\vec{ij}}}{\sum\_{i=1}^{I} \sum\_{j=1}^{J} W\_{\vec{ij}}} \,. \tag{12}
$$

Overall control input u to the system is defined as follows:

$$\mathbf{x}\_{\nu} = \mathbf{x}\_{\mathfrak{c}} - \mathbf{x}\_{\mathfrak{f}} \,, \tag{13}$$

where x<sup>c</sup> and x<sup>f</sup> are the control signals produced by the PD controller and the FNN controller, respectively.

#### 3.3. Triangular Fuzzy MFs

In the FNN, the fuzzy MFs play the important role of overcoming environmental interference. These MFs have already shown promising results for control (Khanesar et al., 2015) and identification (Khanesar et al., 2011) purposes. In this work, typical triangular fuzzy MFs are chosen in order to achieve a faster and robust control performance. The MF is defined as follows:

$$\mu(\mathbf{x}) = \begin{cases} \left(1 - \left|\frac{\mathbf{x} - \mathbf{c}}{d}\right|\right) \text{ if } c - d < \mathbf{x} < c + d\\ \mathbf{0} & \text{otherwise} \end{cases},\tag{14}$$

where x is the input, d and c are the width and the center of the MF. The stability proof can be found in Kayacan and Khanesar (2015) according to sliding mode control theory.

## 3.4. Sliding Mode Control Theory-Based Training Approach

In this paper, SMC based parameter update rules are proposed to guarantee the stability of the system and provide favorable robustness. By utilizing the principles of the SMC theory, the zero dynamics of the learning error coordinate xc(t) can be described as a time-varying sliding surface S<sup>c</sup> in the following form:

$$S\_{\mathfrak{c}}(\boldsymbol{\chi}\_{f}, \boldsymbol{\chi}\_{\boldsymbol{\nu}}) = \boldsymbol{\chi}\_{\mathfrak{c}}(t) = \boldsymbol{\chi}\_{f}(t) + \boldsymbol{\chi}\_{\boldsymbol{\nu}}(t) = \boldsymbol{0} \,. \tag{15}$$

If this condition is satisfied, the FNN structure is trained to become the non-linear regulator which assists the parallel controller (in our case PD controller), and the desired performance of the system can be obtained. Therefore, the sliding surface for the non-linear system under control is given by

$$\mathcal{S}\_{\mathbb{P}}(e,\dot{e}) = \dot{e} + \chi e \,, \tag{16}$$

with χ > 0 being a positive parameter which defines the desired trajectory of the error signal.

The time-varying parameter of the consequent part ˙ fij is updated based on the following adaptation law (Kayacan and Khanesar, 2015):

$$\dot{f}\_{\hat{i}\hat{j}} = -\frac{\tilde{W}\_{\hat{i}\hat{j}}}{\prod^{T}\prod^{T}}\alpha\text{sign}(\boldsymbol{\chi}\_{\mathcal{E}}),\tag{17}$$

where

$$\prod = \left(\sum\_{i=1}^{I} \sum\_{j=1}^{J} \tilde{W}\_{\vec{v}}\right),\tag{18}$$

the learning rate α > 0 is updated based on the following equation:

$$
\dot{\alpha} = \left| \chi\_{\mathfrak{c}} \right|, \tag{19}
$$

The adaptation law for the premise part is given as follows (Kayacan and Khanesar, 2015):

$$\dot{c}\_{1i} = -\gamma\_1 \left| d\_{1i} \right| \left( 1 - T\_{1i} \right) \text{sgn} \left( \mathbf{x}\_1 - c\_{1i} \right) \times H \left( \mathbf{x}\_1, c\_1 - d\_1, c\_1 + d\_1 \right),$$

$$\dot{d}\_{1i} = -\gamma\_1 \frac{(1 - T\_{1i}) \left| d\_{1i} \right|}{\left| \chi\_1 - c\_{1i} \right|} \text{sgn} \left( d\_{1i} \right) \times H \left( \chi\_1, c\_1 - d\_1, c\_1 + d\_1 \right), \tag{21}$$

$$\dot{c}\_{2i} = -\gamma\_1 \left| d\_{2j} \right| \left( 1 - T\_{2j} \right) \text{sgn} \left( \mathbf{x}\_2 - c\_{2i} \right) \times H \left( \mathbf{x}\_2, c\_2 - d\_2, c\_2 + d\_2 \right), \tag{22}$$

$$\dot{d}\_{2j} = -\gamma\_1 \frac{\left(1 - T\_{2j}\right) \left|d\_{2j}\right|}{\left|\mathbf{x}\_1 - c\_{2j}\right|} \text{sgn}\left(d\_{2j}\right) \times H\left(\mathbf{x}\_2, c\_2 - d\_2, c\_2 + d\_2\right),\tag{23}$$

where

$$H(\mathbf{x}, \mathbf{c}, d) = \begin{cases} \mathbf{x} \text{ if } \mathbf{c} - d < \mathbf{x} < \mathbf{c} + d\\ \mathbf{0} \text{ otherwise} \end{cases},\tag{24}$$

$$\begin{cases} T\_{1,i} = \begin{vmatrix} \frac{x\_1 - c\_i}{d\_i} \end{vmatrix} \\ T\_{2,i} = \begin{vmatrix} \frac{x\_2 - c\_i}{d\_i} \end{vmatrix} \end{cases} \tag{25}$$

For the γ1, it needs to be selected as positive (Kayacan and Khanesar, 2015).

#### 4. SIMULATION AND RESULTS ANALYSIS

#### 4.1. Simulation Parameter

The control gains for the PD controller are chosen as follows: k<sup>p</sup> = 10, k<sup>d</sup> = 5.

#### 4.2. Simulation Results

**Table 1** shows the simulation parameters. **Figures 4**–**7** shows the simulation results without noise. The adaptive learning capabilities of the PD+FNN structure can provide superior performance in different conditions. It is able to solve limitations such as the lack of modeling and existing uncertainties in the environment and is therefore more suitable for real-time applications.

As seen from **Figure 4**, the PD+FNN controller has a faster and more stable response performance. **Figure 5** shows that the controller has a better adaptive learning property to lessen the


error gradually. As shown in **Figure 6**, although the PD controller ensures the error signal is bounded in the neighborhood of zero, significant steady-state errors that occur from internal or external interferences cannot be eliminated. Compared to the PD controller, the PD+FNN controller eliminates the steady-state error.

**Figure 7** shows curves of the overall signal (which is defined as x<sup>v</sup> = x<sup>c</sup> − x<sup>f</sup> ), the output of FNN (x<sup>f</sup> ), and the output of the conventional PD controller (xc). As seen in **Figure 7**, the overall control signal is close to the conventional PD controller at the beginning of the simulation, then the FNN learns the dynamics of the system and takes responsibility for the system. Simultaneously, the output of the PD controller tends to go to zero.

As discussed in the section 2.2, the accumulator energy release process can be regarded as interference according to Gaussian distribution. In order to create different noise levels, four different mean values of the accumulator output flow are chosen, i.e., µ<sup>v</sup> = {100, 150, 200, 300} [L/min]. For the standard deviation, it is selected as σ<sup>v</sup> = 70 L/min.

**Figure 8** shows the manipulator position errors under PD controller with the noise level from 0.12 to 0.24 m.

**Figure 8** shows that the PD controller cannot handle steady-state errors that occur from internal or external interferences. However, the PD+FNN controller can eliminate steady-state errors through an adaptive learning algorithm, and can be used in a real time control to cope with noisy measurements and uncertainties in the system more effectively.

**Figure 9** shows the output signals of PD+FNN in different levels of noise. **Figure 9**, shows that the overall control signal is determined by the FNN controller, and the output signal from PD tends to be close to zero. Therefore, the FNN learns the

FIGURE 8 | Euclidean error in different levels of noise, (A) is µv = 100 L/min, σv = 70 L/min, (B) is µv = 150 L/min, σv = 70 L/min, (C) is µv = 200 L/min, σv = 70 L/min, (D) is µv = 300 L/min, σv = 70 L/min.

Xu et al. Robust Learning Control for Manipulator

dynamics of the system and ultimately takes responsibility for the system.

## 5. EXPERIMENTAL VERIFICATION

## 5.1. Introduction of Experimental Equipment

The crane experimental model was placed on a swaying table to simulate the sea state. The relationship between each part of manipulators is shown in **Figure 10**.

The sensor and controller parameters are shown in **Table 2**.

#### 5.2. Analysis of Step Signal Response

The test results are shown in **Figure 11** with the step extension signal.

The release process of the accumulator is shown in **Figure 11**. When the cylinder is extended, the accumulator releases the oil non-linearly. The cylinder protrudes quickly, and the extension time is about 0.25 s. The color block diagram on the right side of **Figure 11** shows that the energy release rate of the accumulator is significantly correlated to the rodless chamber pressure and the cylinder extension speed.

#### 5.3. Experimental Strategy and Data Analysis

The swaying table was controlled with a sinusoidal signal in the frequency of 0.11Hz and an amplitude of ±7 ◦ . The comparison experiment of automatic wave compensation was performed using a PD algorithm and PD+FNN to control the angular at 0◦ .

**Figure 12** shows the wave compensation results of the conventional PD controller. During the wave compensation process, the cylinder pressure is stable, and the fluctuation range is approximately 90–105 Bar. However, the expansion and contraction movement of the cylinder is irregular due to the vicious cycle caused by the accumulator intervening. TABLE 2 | Parameters of components.


The performance of the cylinder control varies greatly, and the delay is severe. The wave compensation angle error is ±3 ◦ . **Figure 13** shows that the accumulator flow output is irregular.

**Figure 14** shows the experimental results of the PD+FNN controller. The optimization strategy effectively solves the speed shock caused by the accumulator. The cylinder control signal is consistent with the motion of the swinging table, which largely reduces the vibration of the cylinder. The wave compensation effect has been improved, and the angle compensation deviation is stable at ±1 ◦ .

**Figure 15** shows the highest value of the accumulator flow output concentrated in the negative angle of the cylinder (cylinder extension).

## 6. CONCLUSION

In this work, a novel control strategy, i.e., PD+FNN approach, is designed to control a shipborne manipulator. Specifically, it is able to handle the high-order non-linearity, time-varying and hysteresis characteristics of the valve-controlled cylinder under the intervention of the accumulator. In addition, it can solve the overshoot generated by the wave compensation process when the accumulator releases energy and the cylinder reacts quickly in the extended stage. Moreover, the presented control strategy is capable of solving the problem of pressure fluctuation. The control precision is improved compared to using a conventional PD controller. Qualitative and quantitative tests on the simulation and real experiments have shown that the proposed controller is capable of significantly reducing

steady state-errors and in overcoming the disturbances caused by the accumulator and uncertainties. The deviation angle of compensation is ±1 ◦ instead of ±3 ◦ compared to the conventional PD controller. We believe that the results of this work will motivate a wider use of the proposed PD+FNN approach, for autonomous collaboration of marine vehicles with a shipborne manipulator.

## DATA AVAILABILITY

All datasets generated for this study are included in the manuscript and/or the supplementary files.

## AUTHOR CONTRIBUTIONS

ZX and WL conceived and designed the experiments. YW performed the experiments. ZX and YW analyzed the data and wrote the paper.

## REFERENCES


Engineering (ICMSC), 2017 International Conference on, Saint Petersburg, 250–254.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Xu, Li and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Mobile Robot Path Planning Based on Ant Colony Algorithm With A<sup>∗</sup> Heuristic Method

#### Xiaolin Dai 1,2, Shuai Long<sup>1</sup> , Zhiwen Zhang<sup>1</sup> and Dawei Gong1,2 \*

*<sup>1</sup> School of Mechatronics Engineering, University of Electronic Science and Technology of China, Chengdu, China, <sup>2</sup> Center of Robot, University of Electronic Science and Technology of China, Chengdu, China*

This paper proposes an improved ant colony algorithm to achieve efficient searching capabilities of path planning in complicated maps for mobile robot. The improved ant colony algorithm uses the characteristics of A<sup>∗</sup> algorithm and MAX-MIN Ant system. Firstly, the grid environment model is constructed. The evaluation function of A<sup>∗</sup> algorithm and the bending suppression operator are introduced to improve the heuristic information of the Ant colony algorithm, which can accelerate the convergence speed and increase the smoothness of the global path. Secondly, the retraction mechanism is introduced to solve the deadlock problem. Then the MAX-MIN ant system is transformed into local diffusion pheromone and only the best solution from iteration trials can be added to pheromone update. And, strengths of the pheromone trails are effectively limited for avoiding premature convergence of search. This gives an effective improvement and high performance to ACO in complex tunnel, trough and baffle maps and gives a better result as compare to traditional versions of ACO. The simulation results show that the improved ant colony algorithm is more effective and faster.

#### Edited by:

*Caixia Cai, Agency for Science, Technology and Research (A*∗*STAR), Singapore*

#### Reviewed by:

*Wenchao Gao, Institute for Infocomm Research (A*∗*STAR), Singapore Ran Duan, Hong Kong Polytechnic University, Hong Kong Bonan Huang, Northeastern University, China*

#### \*Correspondence:

*Dawei Gong pzhzhx@126.com*

Received: *20 December 2018* Accepted: *25 March 2019* Published: *16 April 2019*

#### Citation:

*Dai X, Long S, Zhang Z and Gong D (2019) Mobile Robot Path Planning Based on Ant Colony Algorithm With A* <sup>∗</sup> *Heuristic Method. Front. Neurorobot. 13:15. doi: 10.3389/fnbot.2019.00015* Keywords: path planning, ant colony algorithm, A<sup>∗</sup> algorithm, bending suppression, retraction mechanism

## INTRODUCTION

Path planning is a key issue in the field of mobile robot research. Its main purpose is to find an optimal or suboptimal, safe and collision-free path from the starting point to the target point in the environment with obstacle (Cheng et al., 2010; Deepak et al., 2012; Zhou et al., 2013). According to the degree of intelligence in the process of path planning, mobile robot path planning can be divided into traditional path planning and intelligent path planning. The traditional path planning algorithm includes simulated annealing algorithm (Miao and Tian, 2013), potential function theory (Cetin and Yilmaz, 2014; Nair et al., 2015), fuzzy logic algorithm (Li et al., 2013; Jiang and Li, 2014; Bakdi et al., 2016) and so on. However, these traditional methods can't be further improved in path search efficiency and path optimization. Intelligent path planning algorithm includes Ant Colony Optimization (ACO) (Jovanovic et al., 2016; Wang et al., 2016), genetic algorithm (Arantes et al., 2017; Lin et al., 2017), neural network (He et al., 2016a, 2017a,b) and particle swarm algorithm (Das et al., 2016; Song et al., 2016) and so on. The ant colony algorithm has the advantages of strong robustness, good global optimization ability and inherent parallelism. Moreover, it easily combines with multiple heuristic algorithms to improve the performance of algorithms. So it is widely used in path planning.

However, due to the randomness of probabilistic transfer and the inappropriateness of pheromone intensity update, the traditional ACO will easily fall into the local optimum and tend to poor convergence. To this end, many scholars delivered a variety of improved methods to solve problems regarding pheromone update and path search strategy (Stützle and Hoos, 2000; Zeng et al., 2016; Zhao et al., 2016; Zhang et al., 2017). In Stützle and Hoos (2000), an Ant Colony System (ACS) algorithm was proposed to speed up the convergence rate of ACO by updating pheromones on the path of the optimal ant of each generation. In Zhao et al. (2016), by adaptively changing the volatilization rate and adjusting the pheromone updating formula, the search ability of the ant colony and the convergence rate of the algorithm were improved. In Zhao et al. (2016), some intelligent algorithm was proposed to generate an initial path, which can be transformed into the initial pheromone distribution to avoid blind search of ant colony. In Zhang et al. (2017), the path information (such as the crowded path and the steep path weight) was added into the initial pheromone matrix, which could affects the efficiency of the algorithm. In Zhao et al. (2016), the heuristic function was adjusted to improve the convergence rate of the algorithm according to the target point. In Zeng et al. (2016), it unlimited step length of finding optimal path so that the improved ACO could find a shorter path and its convergence was better. In addition, many scholars have combined the ant colony algorithm with other (intelligent) algorithms (He et al., 2016b; Liu et al., 2016; Yen and Cheng, 2016; He and Zhang, 2017) to improve the convergence rate and the smooth of path. In Liu et al. (2016), the geometric method was used to optimize path. Also in Liu et al. (2016), the force factor in the artificial potential field method is transformed into local diffusion pheromone to improve the ability of the ant colony algorithm to find the obstacle. In Yen and Cheng (2016), the fuzzy ant colony optimization method was proposed to minimize the iterative learning error.

In this paper, an effective version of ant colony algorithm is achieved. It utilizes the evaluation function of A<sup>∗</sup> algorithm to improve the heuristic information of Ant colony algorithm, which accelerates the convergence speed during the search. And MAX–MIN Ant System is used to make the global search ability better by updating the path pheromone of the optimal network. At the same time, the bending suppression operator is introduced to improve heuristic information, which aims to optimize the smoothness of the path. The problem of deadlock is solved by using the retraction mechanism. All these procedures not only give an effective improvement and better performance to ACO, but also give the best results as compare to traditional versions of the algorithm (Zhao et al., 2016) and ACO in complex tunnel, trough and baffle maps. The simulation results show that the proposed algorithm is effective and fast.

#### MATERIALS AND METHODS

#### Environment Model

The work environment is built by using the grid model, which divides the robot working space into N <sup>∗</sup>N squares. As shown in **Figure 1**, the gray grids are represented as obstacles (the grid with barriers) and the white grids are represented as free grid squares

(the robot can move). In order to identify obstacles, the white grid cell is represented by 0 and the gray grid unit is represented by 1. The grid method is simple and effective to create and maintain grid model. Moreover, the grid method have strong adaptability for obstacle. This method is convenient for computer storage and processing.

The grid model was placed into two-dimensional coordinate system. And then serial number method is adopted to mark each grid. In N <sup>∗</sup>N grid map, the starting node is named after Start and the target node is named after Goal. The position coordinates x, y corresponding to any grid whose grid number is R as follow:

$$\begin{cases} \chi = \begin{cases} \bmod(R,N) - 0.5 & \text{if } mod \,(R,N) \text{!} = 0\\ N + mod \,(R,N) - 0.5 & \text{otherwise} \end{cases} \end{cases} \tag{1}$$

Where mod is the surplus operation, ceil rounds the elements to the nearest integers toward infinity.

#### Ant Colony Algorithm

Heuristic Strategy With Direction Information

In the traditional ACO, the probability of the next node is selected by roulette wheel method as follows:

$$\begin{cases} P\_{\vec{\eta}}^{k}(t) = \begin{cases} \frac{\left(\tau\_{\vec{\eta}}(t)\right)^{a} \cdot \left(\eta\_{\vec{\eta}}(t)\right)^{\beta}}{\sum\_{i \in \mathit{allow}\_{k}} \left(\tau\_{\vec{\eta}}(t)\right)^{a} \cdot \left(\eta\_{\vec{\eta}}(t)\right)^{\beta}} & s \in \mathit{allow}\_{k} \\ 0 & s \notin \mathit{allow}\_{k} \end{cases} \tag{2}$$
 
$$\begin{aligned} \eta\_{\vec{\eta}}(t) &= \frac{1}{d\_{\vec{\eta}}} \\ d\_{\vec{\eta}} &= \sqrt{\left(\chi\_{\vec{\jmath}} - \chi\_{\vec{\imath}}\right)^{2} + \left(\chi\_{\vec{\jmath}} - \chi\_{\vec{\imath}}\right)^{2}} \end{aligned} $$

Where τij is the pheromone trail of the path grid i to grid j, and ηij is the heuristic information of the path grid i to grid j. α is the stimulating factor of pheromone concentration which determine the relative influence of the pheromone trail. β is the stimulating factor of visibility which determine the relative influence of the heuristic information. dij is the distance between node i and node j. (x<sup>i</sup> , yi) and (x<sup>j</sup> , yj) is the coordinates of grid i and grid j. allow<sup>k</sup> is the collection of grids which ants can choose when ants in the grid i (in other words, they are the grids around the grid i except the obstacle grid and taboo grid).

#### Coverage and Updating Strategy

According to the traditional ACO, the next node is decided by the roulette wheel method and it is repeated until the target point is obtained. After each iteration is completed, pheromone trails are updated in line with the length of path planning. For each trial during pheromone update, all imperfect pheromones evaporates and only the best pheromones are updated to trials history, because it enables ants to neglect all substandard pheromone trails and improve its coverage efficiency to find a shorter path. Formula (3) is used to update the pheromone quantity on each vertex at the end of each cycle:

$$\begin{cases} \mathsf{r}\_{\vec{l}\vec{j}} &= (1 - \rho)\,\mathsf{r}\_{\vec{l}\vec{j}} + \Delta\mathsf{r}\_{\vec{l}\vec{j}}\\ \Delta\mathsf{r}\_{\vec{l}\vec{j}} = \sum\_{k=1}^{m} \Delta\mathsf{r}\_{\vec{l}\vec{j}}^{k} & , \ 0 < \rho < 1 \end{cases} \tag{3}$$

where m is the number of ants. ρ is pheromone evaporation rate. 1τ <sup>k</sup> ij represents the value of pheromone that the ant k leaves in the path of grid i to grid j. This article uses the ant-cycle-system model, and 1τ <sup>k</sup> ij is defined as follows:

$$
\Delta \pi\_{ij}^k(k) = \begin{cases}
\begin{array}{l}
\text{Q}\_1/L\_k\left(t\right) \text{ if } \arcsin\left(i,j\right) \text{ is used by k in iteration t} \\
0 \text{ otherwise}
\end{array}
\tag{4}
$$

Where Q<sup>1</sup> is a constant. L<sup>K</sup> (t) is the length of the path that the ant k is looking for.

#### Improved Ant Colony Algorithm

The traditional ACO has the following shortcomings: Due to the lack of initial pheromone and the unapparent difference of the heuristic value between adjacent grids, it usually requires a longer search time, which leads to the slow convergence rate. When grid model is complex, the robot maybe fall into a deadlock state in which the robot cannot move to the surrounding grids. In the grid map, the path planning with traditional ACO may have more bending times and big cumulative bending angle. To solve the above problems, this paper makes the following improvements.

#### Heuristic Information Based on A<sup>∗</sup> Algorithm

A ∗ algorithm (Duchon et al., 2014) is the most effective direct search method for solving the shortest path in static road network. It is developed on the basis of Dijkstra algorithm, which can avoid blind search to improve search efficiency. In this paper, A<sup>∗</sup> algorithm is used as the heuristic information of path searching to improve the convergence speed of the algorithm and obtain the better path. The bending suppression operator is added to the heuristic information to reduce bending times and cumulative bending angle.

The heuristic cost of A<sup>∗</sup> algorithm is expressed by the estimated function, and the estimated function equation f (n) is as follows:

$$f\left(n\right) = g\left(n\right) + h\left(n\right)\tag{5}$$

$$h\left(n\right) = \left(\left(n\_{\rm x} - \mathrm{g}\_{\rm x}\right)^{2} + \left(n\_{\rm y} - \mathrm{g}\_{\rm y}\right)^{2}\right)^{1/2}$$

$$\mathrm{g}\left(n\right) = \left(\left(n\_{\rm x} - \mathrm{s}\_{\rm x}\right)^{2} + \left(n\_{\rm y} - \mathrm{s}\_{\rm y}\right)^{2}\right)^{1/2}$$

where g(n) is the minimum cost from the source node to the current node. h(n) is the minimum cost from the current node to the destination node. n<sup>x</sup> and n<sup>y</sup> are the coordinates of the current node n . g<sup>x</sup> and g<sup>y</sup> are the coordinates of the target node g, sx, and s<sup>y</sup> are the coordinates of the initial node s.

The estimated function of A<sup>∗</sup> algorithm is used as heuristic information to search for global optimal path in ant colony algorithm, and the bending suppression operator is added to the heuristic value of ant colony algorithm to reduce the number of bending times and the large cumulative turning angle. The improved heuristic information formula is as follows:

$$\eta\_{ij}(t) = \frac{Q\_2}{h(n) + \mathcal{g}(n) + cost(bend)} \tag{6}$$

$$cost \, (bend) = \boldsymbol{\varphi} \cdot turn + \boldsymbol{\psi} \cdot thita$$

where Q<sup>2</sup> is a constant more than 1. cost(bend) is a bending suppression operator. turn is the number of turns from node n − 1(previous node) to node n + 1 (next node). thita is the angle between the line segment of node n − 1 to node n (current node) and the line segment of node n to node n + 1. ϕ is the coefficient converting turning times into grid length. ψ is the coefficient converting angle into grid length.

#### Solve the Deadlock Problem

When the robot environment is more complex (especially the ants go into the environment of concave obstacles), due to the presence of the taboo table, the ants may fall into a deadlock state without the next grid to move. As shown in **Figure 2**, when the ant travels from the grid T to the grid S, the next optional grid is C. At this time, the ant is trapped in a deadlock state and it cannot move to its surrounding grid.

For the deadlock problem, Wang and Yu (2011) adopted the early death strategy, which deleted the ants trapped in a deadlock state from the ant colony and did not update the global pheromone. However, when more of the ants are trapped in the deadlock state, the number of ants that can reach the goal is significantly reduced, which results in a decrease in the diversity of solutions and is not conducive to the search of optimal path for ants. In this paper, the improvement measure is that the ants adopt retraction mechanism when they fall into the deadlock state. As shown in **Figure 2**, the ant, which has walked into the grid, is trapped in the deadlock state, and the improved strategy allows the ant to roll back one step and updates the taboo table information. If the ant is still trapped into a deadlock state, the ant will continue to rollback untill grid T. At this moment, the ant escapes the deadlock area. Since the deadlock

edge may be the part of global optimal path, no pheromone punishment is carried out on the deadlock edge. The retraction mechanism cannot prevent ants from entering a deadlock state, but it lets the deadlocked ants return back to the previous grid until there is a feasible grid around the ants, so the ACO with the retraction mechanism has higher efficiency and fewer iterations. The ACO with the retraction mechanism and without the retraction mechanism is compared in section The Retraction Mechanism Results Analysis below.

#### Max–Min Ant System

As the traditional ant colony algorithm may cause premature convergence and precocious phenomenon, it needs to improve algorithm to solve these problem. The MAX-MIN Ant System (MMAS) (Stützle and Hoos, 2000) can solve these problems well.

(1) Pheromone trail updating. After each iteration trial, the pheromone is submitted into update history in traditional ant colony algorithm. While in the MMAS, only the path pheromone of the optimal network is updated after the iteration is completed. Accordingly, the modified pheromone trail update rule is stated by:

$$\tau\_{\vec{ij}}(t+1) = (1 - \rho)\tau\_{\vec{ij}}(t) + \Delta\tau\_{\vec{ij}}^{best} \tag{7}$$

$$\Delta\tau\_{\vec{ij}}^k(t) = \frac{Q\_1}{L^{best}} + \frac{Q\_3}{C\_{turn}^{best}}$$

$$C\_{turn}^{best} = \omega\_1 \text{Cals}\left(l\right) + \omega\_2 \text{Turns}\left(l\right)$$

$$\begin{aligned} \omega\_1 &= \frac{V\_{robot}}{W\_{robot}}\\ \omega\_2 &= V\_{robot} \times t\_a \end{aligned}$$

where Q<sup>3</sup> is a constant more than 1.L best denotes to the shortest path currently found by the algorithm. Cals(l) represents the sum of all the angles of turning on the best optimized path. Turns(l) is the sum of the turns on the best optimized path. w<sup>1</sup> and w<sup>2</sup>

represent different weight coefficient and are set by analyzing the robot's structure and kinematics (Wu et al., 2013; Li et al., 2017). The w<sup>1</sup> and w<sup>2</sup> can convert turning angle and turning times into grid length, respectively. Vrobot represents the constant speed of a mobile robot. Wrobot represents the angular speed of a mobile robot as it turns. t<sup>a</sup> represents the time of acceleration and deceleration as the mobile robot turns once.

(2) Pheromone trail limits. In order to avoid the situation that the traditional ant colony algorithm may falls into local optimal solution and loses the further search space ability by pheromone accumulation, the pheromone trail of the MMAS is limited in the upper limits and lower limits - Taumin,Taumax, . The formula is:

$$\text{Tau} = \begin{cases} \text{Tau}\_{min}, & \text{Tau} \le \text{Tau}\_{min} \\ \text{Tau}, & \text{Tau}\_{min} < \text{Tau} \le \text{Tau}\_{max} \\ \text{Tau}\_{max}, & \text{Tau} > \text{Tau}\_{min} \end{cases} \tag{8}$$

#### Aco Procedure

To sum up, specific steps of mobile robot path planning based on the improved ant colony algorithm are as follows:

Step 1: The working environment is modeled by the grid method, and the starting point start and the target point goal of the mobile robot are given.

Step 2: Initialize the ant system. Set the number m of ants, parameter α which determines the relative influence of the pheromone trail, parameter β which determines the heuristic value, the global pheromone volatilization coefficient ρ, pheromone intensity Q<sup>1</sup> and other related parameters.

Step 3: Update taboo table. Place the ant k (k = 1, 2, · · · , m) on the current node and add the current node to the corresponding taboo table.

Step 4: Process deadlock. According to the taboo table, it will judge whether ants are trapped in a deadlock state. If the ants are in a deadlock state, the retraction mechanism will be adopted and the deadlock node will be added to the taboo table. Conversely, it will judge whether the ants reach the target point. If the ants reach the target point, it will turn to Step 6, otherwise it will turn to Step 5.

Step 5: Select the next grid. It will calculate the heuristic function according to formula (6), and calculate the probability function according to formula (2). Finally, it will use the roulette method to select the next feasible grid. If the ants reach the target grid, it will turn to Step 6, otherwise it will turn to Step 3.

Step 6: If the ants reach the target node, it will repeat Step 3 until each ant completes the search target during its iteration process and then turn to Step 7.

Step 7: Update pheromone. After each iteration, if the number of iterations satisfies inequality N ≤ Nmax, it will update the path pheromone and determine whether it meets the convergence conditions. If it meets the convergence conditions, it will withdraw. If it does not meet, it will turn to Step 3. If the number of iterations satisfies inequality N > Nmax, it will be not counted further. The final result is output as long as the end condition is satisfied.

The implementation process of improved ant colony algorithm is as in **Table 1**.

## RESULTS

In order to verify the effectiveness of the improved ant colony algorithm, this paper uses MATLAB software to simulate. It is more convincing to use comparative method to carry out experiments under the same experimental conditions. In the simulation, the main parameter values of the ACO should be determined firstly. The main parameters include number of ants, stimulating factor of pheromone concentration, stimulating factor of visibility and pheromone evaporation coefficient. Through parameter analysis method (Wu et al., 2010), the relationship between each parameter and simulation results (path length, number of iterations) can be obtained. According to the relationship between each parameter and simulation results (Shi et al., 2014), we can get the value of the main parameters in the ACO. In the simulation, value of each parameter in the ACO is as in **Table 2**:

## Comparative Analysis of Path Planning Algorithms

The experiment was divided into four parts according to four types of maps(the common map, the tunnel map, the trough map and the baffle map) and three algorithms (the traditional ant colony algorithm, the algorithm (Zhao et al., 2016) and

TABLE 1 | Description of ACO algorithm for solving path planning.

#### Algorithm A\*MMAS Begin

create grid environment initialize the ant colony system

#### Repeat

for each ant k do

if grid *i* ∈ *allowk* then

if grid *i* ∈ *taboodeadlock* then

fallback

end if

according to formula (2) and (6) select next grid j

#### Update taboo end if

Update pheromone on each iteration by improved MMAS method according to formula (7) and (8)

Until algorithm convergence

Return best grid serial number

END

TABLE 2 | Values of the main parameters in the ACO.


the improved ant colony algorithm proposed in this paper) are simulated on each map in turn**.** The convergence speed, shortest path length and bending suppression effect of those algorithms are compared.


As shown in **Figure 3**, the optimized path length of the improved ant colony algorithm is 29.2133 and the number of bending times is 6. The improved ant colony algorithm is basically as same as the path planning effect of the ant colony algorithm (Zhao et al., 2016) on the path length, but it is 25% lower on bending times than the ant colony algorithm (Zhao et al., 2016). Compared with the traditional ant colony algorithm, it is 73% reduction in the number of bending times.

As shown in **Figure 4**, the optimized length of the improved ant colony algorithm is 37.3849, and the number of bending times is 7. In the shortest path length, the improved ant colony algorithm is basically as same as the algorithm (Zhao et al., 2016). In the number of bending times, it is 50% decrease than the traditional ant colony algorithm and is 22% decrease than the algorithm (Zhao et al., 2016).

As shown in **Figure 5**, the optimized path length of the improved ant colony algorithm is 51.1128. In **Figure 6**, the optimized path length of the improved ant colony algorithm is 50.7280. But in **Figures 5**, **6**, both the traditional ant colony algorithm and the algorithm (Zhao et al., 2016) can't search the global optimized path. Even as the scale of the problem expands and the environment map becomes more and more complex, the improved algorithm can still perform very well.

The results of the three algorithms that run 100 times in same map environments are shown in **Table 3**. Compared with the traditional ant colony algorithm and the algorithm (Zhao et al., 2016), the improved algorithm has a good performance on the efficiency. At the same time, it has a good adaptability in a complicated area. The improved algorithm proposed in this paper can be used not only in the path planning of mobile robots, but also in the path planning of robot manipulators (Yang et al., 2017, 2018).

## The Retraction Mechanism Results Analysis

In order to show the function of retraction mechanism, the ACO with the retraction mechanism and ACO without the retraction mechanism are tested on the trough map and the baffle map, respectively.

As shown in **Figures 7A**, **8A**, ACO with the retraction mechanism has higher efficiency and fewer iteration than ACO

TABLE 3 | Test results for three algorithms under different maps.


➀*: The traditional ant colony algorithm.*

➁*: The algorithm [20].* ➂*: the improved ant colony algorithm.*

without the retraction mechanism. When ants fall into deadlock state, the retraction mechanism is used to replace the early death strategy, which avoids a large number of ant deaths in one iteration. Therefore, each ant can obtain a path by using the retraction mechanism, which increases the diversity of results and is beneficial to find the optimal path. As shown in **Figures 7B**, **8B**, the number of ant retracted in the initial stage of the algorithm is higher than in the middle and later stage of the algorithm and the retraction mechanism can effectively suppress the decline of the number of ants.

### DISCUSSION

This paper makes a valuable contribution to the improvement of ant colony algorithm in complicated maps for the mobile robot, especially the improvement on convergence speed, shortest path length and bending suppression effect. The estimated function of improved A<sup>∗</sup> algorithm is used as the heuristic function to improve search efficiency and smoothness of path. By employing the retraction mechanism and the improved MAX–MIN Ant System method, the problem of ant deadlock is solved and the global search ability of the algorithm is improved.

Three algorithms are researched on path planning in the common map, tunnel map, trough map and baffle map, respectively. Compared with the traditional ant colony algorithm and the algorithm (Zhao et al., 2016), the improved ant colony algorithm is better in the convergence rate and the bending suppression effect. Compared with the traditional ant colony algorithm, the improved ant colony algorithm has more than 65% reduction in number of iterations and 41% decrease in bending suppression. In addition, the improved ant colony algorithm is 54% lower than the algorithm (Zhao et al., 2016) in number of iterations. To sum up, this paper proves the effectiveness, rapidity and adaptability of the improved ant colony algorithm in the complex map environment.

## AUTHOR CONTRIBUTIONS

XD and DG proposed the innovation and designed the experiment in this study. ZZ and SL performed the simulation experiment and analyzed the experiment results. XD checked the

results. SL wrote the manuscript and DG provided writing advice of manuscript.

#### FUNDING

This work was supported by the National Natural Science Foundation of China (61603076) (51305066), Fundamental Research Funds for the Central Universities (ZYGX2016J116),

#### REFERENCES


and Science and Technology Support Program of Sichuan Province (2016GZ0198).

## ACKNOWLEDGMENTS

We acknowledge the support of the School of Mechatronics Engineering and Center of Robot in University of Electronic Science and Technology of China.


algorithm for multi-robot path planning. Swarm Evol. Comput. 28, 14-28. doi: 10.1016/j.swevo.2015.10.011


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Dai, Long, Zhang and Gong. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Neural Network Based Uncertainty Prediction for Autonomous Vehicle Application

Feihu Zhang<sup>1</sup> \*, Clara Marina Martinez <sup>2</sup> , Daniel Clarke<sup>3</sup> , Dongpu Cao<sup>4</sup> and Alois Knoll <sup>5</sup>

*<sup>1</sup> School of Marine Science and Technology, Northwestern Polytechnical University, Xi'an, China, <sup>2</sup> Porsche Engineering Services GmbH, Bietigheim-Bissingen, Germany, <sup>3</sup> Cogsense Technologies Limited, London, United Kingdom, <sup>4</sup> Mechanical and Mechatronics Engineering, University of Waterloo, Waterloo, ON, Canada, <sup>5</sup> Department of Informatics, Technical University of Munich, Munich, Germany*

This paper proposes a framework for uncertainty prediction in complex fusion networks, where signals become available sporadically. Assuming there is no information of the sensor characteristics available, a surrogated model of the sensor uncertainty is yielded directly from data through artificial neural networks. The strategy developed is applied to autonomous vehicle localization through odometry sensors (speed and orientation), so as to determine the location uncertainty in the trajectory. The results obtained allow for fusion of autonomous vehicle location measurements, and effective correction of the accumulated odometry error in most scenarios. The neural networks applicability and generalization capacity are proven, evidencing the suitability of the presented methodology for uncertainty estimation in non-linear and intractable processes.

#### Edited by:

*Changhong Fu, Tongji University, China*

#### Reviewed by:

*Liang Lu, Polytechnic University of Madrid, Spain Chao Zhang, Shanghai Jiao Tong University, China Hui Xie, Western Sydney University, Australia*

#### \*Correspondence: *Feihu Zhang feihu.zhang@nwpu.edu.cn*

Received: *08 October 2018* Accepted: *18 March 2019* Published: *10 May 2019*

#### Citation:

*Zhang F, Martinez CM, Clarke D, Cao D and Knoll A (2019) Neural Network Based Uncertainty Prediction for Autonomous Vehicle Application. Front. Neurorobot. 13:12. doi: 10.3389/fnbot.2019.00012* Keywords: neural network, autonomous driving, uncertainty prediction, localization, odometry

## 1. INTRODUCTION

Mobility has become a serious challenge in a society with a gradually aging population and a perpetually increasing traffic. This situation is motivating the automotive industry and governments to invest heavily in highly automatized vehicles toward full autonomy. Nevertheless, the progress in autonomous vehicle integration is being hindered from an engineering perspective, due to limits in vehicle sensing and infrastructure modernization requirements (Ma et al., 2018; Taeihagh and Lim, 2019). Moreover, considerable advances in electronics and control theory, safety and robust autonomous driving can only be achieved in conditions that vehicles are fully aware of the driving scenario (Han et al., 2012; Li et al., 2014).

High quality measurements can be obtained using expensive sensors (Elfring et al., 2016), as installed in the well-known Google Car. This vehicle includes an advanced laser range finder, between other sensors, able to process the environment real time (Poczter and Jankovic, 2014). Nonetheless, these advanced devices are generally associated with elevated cost, and therefore are not feasible for serial production vehicles. An alternative solution is to compensate measurements quality with higher redundancy by installing larger number of low cost devices based on different technologies. As a consequence, the features perception becomes a complex problems where heterogeneous signals need to be registered, transformed into a common level and conveniently combined to guarantee safety (Jiang et al., 2011). This process is known as data fusion and usually involves noisy measurements and highly non-linear transformations.

Data fusion can be executed in either centralized or decentralized architectures. Whilst the first involves a common processor, and decentralized architectures consist of networks where each sensor has its own processing unit (Grime and Durrant-Whyte, 1994; Durrant-Whyte et al., 2001; Garcia-Ligero et al., 2012). On the one hand, centralized architectures need to be re-designed when changes in the sensing units take place, which implies costly and time-consuming development. On the other hand, decentralized solutions are particularly convenient in networks where sensors can be dynamically added and removed from the networks as a result of being sporadically available. Nevertheless, although a plurality of measurements might become accessible in decentralized architectures, fusion requires knowledge of the uncertainty associated to them. Furthermore, measurements are not used directly, but information extracted from them hereby referred to as features, which generally involves nonlinear transformations. Consequently, the conversion of sensor noise into feature noise is a complex task that usually involves arduous mathematical derivations and can be intractable in many applications.

Several attempts to uncertainty prediction for inertial measurement units (IMU) have been presented in the literature. These include methodologies to fuse odometric measurements with global positioning system measurements, geographical information systems and laser scanners, between others. Vision systems are used in Park et al. (2012) and the disparity between the image space and the Cartesian space is used to derive the uncertainty mathematical model. Vision based controlled is used by Fu et al. (2018) in a system based on an onboard camera and an IMU. The authors use a non-singleton fuzzy logic controller able to handle high uncertainties. The Kalman filter has been also widely used to deal with noisy measurement and models. The parameters estimation was performed using methods such as random walk, Gaussian-Markov and autoregressive processes (El-Diasty and Pagiatakis, 2008). Extended Kalman filters have been proposed by other authors such as Bry et al. (2012) and Fabrizi et al. (2000), where the noise assumption is taken using Gaussian white noise.

This paper presents a solution to facilitate data fusion in decentralized architectures. The proposed paper enhances our previous system for feature extraction (Martinez et al., 2017), where either the source of information, or the sensor noise is unknown. These networks require an appropriate estimation of the signals uncertainty so as to properly fuse them into an "improved measurement," rather than worsen the fusion output. The uncertainty allows evaluating the quality of the signals and provides a combined result of higher accuracy where the information retained is maximized.

Despite its importance, in the literature revised sensor noise is either assumed to be known, or fitted with simple Gaussian distributions. Hereby, a methodology directly for uncertainty prediction from raw data is proposed based on Artificial Neural Networks (ANN), assuming no information is prior available about the sensor characteristics. The applicability of this data-driven strategy extends to highly non-linear and even intractable feature transformation, avoiding tedious mathematical derivations. Proof of this is supported by its implementation for autonomous vehicle location through odometry data, obtaining satisfactory results in varied scenarios.

## 2. PROBLEM DEFINITION

Autonomous driving highly depends on the sensor measurement, uncertainty and fusion. Nonetheless, sensor models are not generality provided by sensor manufacturers. This lack of data significantly increases challenges in decentralized architectures, where new sensors can be "plugged & played" within the ad-hoc network.

## 2.1. Motivation

The operational limits of the sensor technology condition safety in autonomous driving, as a consequence of their strong dependence on the measurements quality (Zheng and McDonald, 2003; Michalke et al., 2011). The GPS precision limits are a well-known example of the noise effect in systems performance, experienced daily by the general public through of the shelf navigation devices (Schrader et al., 2012). This involves the fusion of pseudo-range GPS signals of vehicles, used to minimize the error produced by uncontrolled sources like satellite clock bias, atmospheric delay, and acquisition noise. Nevertheless, despite the complexity of the error origin, previous studies model noise using Gaussian distributions (Liu et al., 2013).

The main barriers for sensor fusion in application, such as vehicle localization, are found in the uncertainty of sensor technology integrated in each vehicle. This inevitably affects the uncertainty characteristics, and the different nature of the signals to fuse, involving highly non-linear feature transformations (Xu et al., 2014). Furthermore, incorrect uncertainty estimation could reduce the fusion accuracy and produce a security hazard by deteriorating the system performance. Regardless of its importance, most research in this area has limited the error prediction to single vehicle model-based approaches usually using Gaussian distributions, developed either theoretically or empirically. Therefore, more accurate uncertainty estimation would be of great benefit for these applications, and would solve issues that hinder the implementation of the autonomous technology nowadays.

## 2.2. Problem Statement

Odometry measurement for vehicle location is subjected to sensor noise applied to velocity d and orientation θ. This uncertainty is extended to the features x and y coordinates, which are calculated from the noisy signals through geometrical transformations (Choi and Huhtala, 2016). Consequently, the vehicle location error is accordingly described by non-linear mathematical equations and accumulates along the path with every sampling time. The absence of information about the features uncertainty, x and y covariance, prevents from fusing odometry data with additional measurements that might become available along the trajectory. By means for this, the estimation of the location uncertainty is of great interest to efficiently correct the accumulated error (Zhang et al., 2013). Hereby a feature noise estimator is obtained from data with independence of the sensor characteristics, and complexity of the feature transformation. This solution allows for sensor noise prediction, avoids the use of complex mathematical formulations and facilitates sensor fusion under any use case.

#### 2.3. Data-Driven Modeling

Feature transformations are often difficult to derive in mathematical terms, and calculations that are usually time consuming and occasionally intractable. Nevertheless, vehicle trajectory data collection under real-life conditions is generally possible using limited resources. These evidences support the use of data-driven algorithms, methods that can efficiently manage big data and yield insightful conclusions from unknown complex processes (McAfee et al., 2012; Hou and Wang, 2013). Various algorithms can be used to derive the so-called surrogate models without requiring actual understanding of the relationship between inputs and outputs. These models are compact, normally inexpensive to evaluate compared to their homologous strictly mathematically derived. Furthermore, they are mathematically tractable and can estimate the process with high-fidelity at least locally to the training set (Gorissen et al., 2010; Koziel et al., 2011). Some examples of surrogate modeling techniques include: polynomial regressions, kernel methods, kriging, support vector machine, Radial Basis Functions (RBF), and Neural Networks (NN) (Jin et al., 2001; Razavi et al., 2012). Each method has different characteristics in terms of operation, complexity, design flexibility and fidelity capability. For instance, support vectors perform particularly well with high dimensional spaces when only scarce training data is available (Forrester and Keane, 2009). Highly non-linear and complex process are better captured using RBF, kriging, and NN, which require determining a specific number of parameters by trial and error. From a high level analysis, the structural limits of RBF are relaxed with kriging, which assumes the model response has stochastic behavior and fits it with a statistical basis. In the kriging method the basis function variance is considered a parameter, providing larger flexibility and resultant increase in training time.

Artificial Neural Networks allow modeling the relationship between inputs and outputs from data. This characteristic, applied to sensor noise, is expected to be able to find the underlying relation between measurement and noise associated to them. Furthermore, NN accept multiple inputs, which can be used to determine additional features affecting the noise and their correlation. With these precedents, NN offer an exceptional framework for implementing and testing the suitability of models generated from data applied to noise magnitude prognosis of sensor measurements. Hereby, NN are selected within the above strategies to exploit their potential for sensor noise estimation, and explore their high level of flexibility owing to their substantial number of defining parameters: network structure, neuron function, number of hidden layers, and number of neurons per layer.

## 2.4. Design for Surrogate Model Development

In terms of number of hidden layers, the criterion applied focuses on the trade-off between accuracy and generalization capabilities. Sensor fusion algorithms will benefit from a guidance to assess the extent of the error covariance of new sensor measurements. This information will allow the system to identify the degree of information present in the new measurement and perform the data fusion accordingly, ensuring the output maximizes the information content. It is therefore acceptable to obtain a guidance value of this error covariance and not highly precise results, reason why the network structure selected for sensor noise estimation is formed by a single hidden layer. This simplified structure might prevent from learning accurate noise behavior as observed in deep learning, but would also facilitate training and avoid noise fitting when applied to noisy sensor signals. Generalization capability is hereby prioritize against results accuracy with the selection of the single hidden layer structure.

The network layers, named input, hidden and output, can be connected through either feedforward (FF) configuration or using a feedback (FB) connection. Layers in FFNNs only receive information from forward layers, whilst in FBNNs any neurons can connect with each other. Consequently, signals in FBNN are repeatedly transformed and lean toward steady state or vibration state. By introducing feedback delays, this structure is also able to capture the relationship between past inputs and current output, influence that is completely ignored in the FF configuration. In the following, both FF and FB configurations are examined as candidates of NN structure so as to determine the most suitable configuration with the support of the training and test results.

Once the network structure is defined, the size of the layers need to be determined. Input and output layers are constrained by the input and output signals selected set, but the hidden layer is a prior a free parameter, closely related with the process complexity. By reason of lack of known mathematical formulation of this particular case, this number has to be determined by trial and error. The optimal size criteria should consider a trade-off between complexity, accuracy, and generalization capability of the neural network candidates. Excessively complex networks not only raise training time, but also increase the risk of over fitting, which would return high accurate results over the training set and poor generalization capacity on new data (Hagan et al., 2014). The growing method is used in this application in order to prevent for over fitting by establishing an initial network with relative small size, and increasing it gradually with special attention over both the training and testing accuracy.

## 3. TRAINING DATA GENERATION AND ANALYSIS

To encourage acceptable performance under all possible scenarios, the amount and variability of the training data should ideally account for any conceivable use case. The data selected for training proceeds from six different trajectories that combine disparate direction, length, orientation, and speed as depicted in **Figure 1**.

## 3.1. Training Data Generation

The trajectories contain highly precise vehicle locations in Cartesian coordinates, and yaw signals, regarded as Ground

Truth (GT). The GT signals need to be processed to generate realistic data complying with a real case scenario. Training data is generated assuming the vehicle velocity and orientation, identified herewith with the symbols d and θ respectively, are collected with sensors characterized by white noise. For the purpose of the following investigations, the standard deviation values 0.1 and 0.001 are respectively selected for speed and orientation measurements. These values are based on experience and are considered representative of general noise measure in odometry sensors.

• Calculate d and θ GT, dGT and θGT, from xGT and yGT.

• Add white noise artificially to dGT and θGT, through a Monte Carlo (MC) simulation with 1000 iterations. These results in d and θ measured (M), d<sup>M</sup> and θM, and emulates sensor noisy data acquisition.

• Use the inverse equations to calculate x<sup>M</sup> and y<sup>M</sup> from d<sup>M</sup> and θM, which in essence is the signal to feature transformation.

• Use the 1000 MC noisy versions of the trajectories to calculate the location error standard deviation (std) in x and y and the location covariance (covxy).

**Figure 2** illustrates the detailed process in a flow diagram describing the steps and signals obtained from ground truth to measured feature data. As included in the respective steps for geometrical transformations, the signals d and θ could be obtained along the sampling steps in a cumulative manner as detailed in following equations (Zhang et al., 2013):

$$\mathbf{x}\_n = \sum\_{i=1}^n d\_i \cdot \sin\left(\sum\_{j=1}^i \theta\_j\right) \tag{1}$$

$$y\_n = \sum\_{i=1}^n d\_i \cdot \cos\left(\sum\_{j=1}^i \theta\_j\right) \tag{2}$$

where n refers to the current time step. By combining Equations (1) and (2), the variables of interest can be obtained for every sampling time:

$$\theta\_{i+1} = \arctan\left( (\mathbf{x}\_{i+1} - \mathbf{x}\_i) / \left( \mathbf{y}\_{i+1} - \mathbf{y}\_i \right) \right) - \sum\_{j=1}^{i} \theta\_j \tag{3}$$

$$d\_{i+1} = (\varkappa\_{i+1} - \varkappa\_i) \left/ \sin \left(\sum\_{j=1}^{i+1} \theta\_j \right) \right. \tag{4}$$

Noise can be artificially added to the GT results of Equations (3) and (4) by randomly generating numbers with the previously designated standard deviation. The results are regarded as sensor measurements and can be used to obtain the measured features following Equations (1) and (2), implemented as:

$$\varkappa\_{i+1} = \varkappa\_i + d\_{i+1} \cdot \sin\left(\sum\_{j=1}^{i+1} \theta\_j\right) \tag{5}$$

$$\gamma\_{i+1} = \gamma\_i + d\_{i+1} \cdot \cos\left(\sum\_{j=1}^{i+1} \theta\_j\right) \tag{6}$$

The ground truth original data and measured features can be compared to determine the uncertainty over the vehicle location at every point of the trajectories. This is defined by the standard deviation of the feature estimation error:

$$
\sigma\_{\mathfrak{x}} = \sqrt{\frac{1}{N} \sum\_{i=1}^{N} (e x\_i - \mu\_{\mathfrak{x}})} \tag{7}
$$

where N corresponds to the number of MC iterations of the same trajectory. The estimation error and the mean estimation error can be calculated as follows:

$$e\mathbf{x} = \mathbf{x}\_{GT} - \mathbf{x}\_{M} \tag{8}$$

$$
\mu\_{\mathfrak{x}} = \frac{1}{N} \sum\_{i=1}^{N} e \mathbf{x}\_i \tag{9}
$$

Similarly, the equations can be applied to y coordinate to obtain ey,µy, σy. Finally, the covariance of the errors in x and y is obtained by:

$$\text{cov}\_{\mathbf{x}\mathbf{y}} = \frac{1}{N} \sum\_{i=1}^{N} \left( \mathbf{x}\_i - \boldsymbol{\mu}\_{\mathbf{x}} \right) \left( \mathbf{y}\_i - \boldsymbol{\mu}\_{\mathbf{y}} \right) \tag{10}$$

The uncertainty is defined as σx, σy, and covxy. Mean error in x and y bias the location measurement, but are not considered as estimation targets in this particular study. The previous transformation provides information of the uncertainty of the vehicle location, when measured through noisy velocity and orientation sensors subjected to a specific level of white noise. This measurement allows evaluating the quality of the current location through odometry, and therefore to which extent this measure should contribute into a sensor fusion framework when compared to other sources.

σ<sup>x</sup> obtained for each of the trajectories is illustrated in **Figures 3**, **4**, by assuming the vehicle is perfectly located at the initial point. **Figure 3** represents σ<sup>x</sup> growth along the entire trajectories until the end point of the longest one, whilst **Figure 4** illustrates a zoom in the σ<sup>x</sup> to better visualize the uncertainty accumulated in shorter paths. σ<sup>x</sup> always presents and increasing trend due to the cumulative characteristics of the error in the vehicle location. Nonetheless, this tendency of accumulation differs between trajectories, which suggests that the shape of the trajectory and the characteristics of

the displacement affect the uncertainty growth. The curves are therefore dissimilar between trajectories and presumably influenced by variables such as 1x, 1y, and 1yaw. Analogous behavior is observed when analyzing σ<sup>y</sup> and covxy, which agrees with the previous assumption.

The growth of the combined uncertainty in all directions σx, σy, and covxy is illustrated in **Figures 5**, **6**, which is represented via ellipses that increase in area as the error accumulates. A first visual examination allows identifying how larger increments in a specific

direction affect to the growth of the uncertainty differently, which also allows drawing a priori hypothesis of the variables closely related. In the following, the study concentrates on σx, although the results and conclusions are expected to be dimension agnostic and applicable to both σ <sup>y</sup> and cov xy.

#### 3.2. Training Data Analysis

As aforementioned, one of the remaining design parameters of the NN is the number of inputs. Ideally, the input variables should contain the maximum number influencing factors over the target to estimate, but its scope should be constraint to prevent from excessive training time and network complexity and overfitting. The optimal input selection should aim to gather maximum relevant information for the prediction and minimum non-relevant data. Non-useful information would increase the model complexity, and might introduce misleading data that deteriorate the generalization capability.

Inputs can be selected using common sense in easily interpretable applications, albeit there are alternative correlation analysis able to evaluate numerically their level of dependence with respect to the target (Sudheer and Ramasastri, 2002). In addition to a prior evaluation of inputs and outputs, NNs can be themselves used for signals selection. In simple network structures, the importance of each signal can be identified by looking into the magnitude of the weights that connect them to the successive layers. A formal analysis of the signals weights is included in Giordano et al. (2014), where a criteria for input selection is derived mathematically and tested.

As a result of the noisy characteristics of the variables used in this particular application, complex signal evaluation is not considered of interest. Instead, the procedure follows inputs selection, by consisting the combination of a correlation analysis with the training and testing results interpretation of various input candidates. First, the signals linear correlation is studied calculating the Pearson correlation coefficient with respect to the output. The high sampling rate used compared to the input candidates' variability, allows assuming no time lag exists between inputs and outputs, simplifying the evaluation.

Although immediate effect of inputs over the outputs is impracticable, the study of the signals' variability with respect to delayed ones suggests this assumption is acceptable (Maier and Dandy, 1997). The output of the correlation provides useful information to select several input candidates, which are later tested at a second stage to conclude into the most suitable option. The best candidate is assessed in terms of training and testing results in an iterative process.

#### 3.3. Inputs vs. Output Correlation

The signals available to use as inputs are: x, y, yaw, θ, d. The use of values such as real x and y directly impairs generalization, as the network would learn from training trajectories characterized by specific absolute location points. Moreover, the incremental tendency of the uncertainty suggests additive behavior happening in every trajectory with independence from the initial and relative vehicle location. This reasoning supports the use of signals increments between sampling steps, rather than absolute values with respect to a pre-defined reference system.

Another hypothesis that can be reasonably stated is the effect of the direction of the displacement over the uncertainty growth; that is, whether variables increments or absolute variables increments are suitable for the input set. This conjuncture would be resolved when determining the effect of the inputs increments sign on the uncertainty accumulation. It is sensible to assume that the features uncertainty is affected by the actual magnitude of the displacement with independence on the direction; the uncertainty should not be affected by the reference system. Consequently, it could be deduced that the sign omission would avoid needless information to be fed into the NN, and therefore would encourage the generalization capacity.

All previous hypotheses are considered to determine the signal candidates to evaluate in the correlation analysis. **Table 1** contains the Pearson correlation coefficients between input candidates and output in all training trajectories. The formula used is the base line Pearson equation, where r, cov, and σ represent respectively correlation coefficient, covariance and variance of the designated signals (Lee Rodgers and Nicewander, 1988).

$$r\_{\text{xy}} = \text{cov}(\nu ar\_1, \nu ar\_2) / \sigma\_{\text{var1}} \sigma\_{\text{var2}} \tag{11}$$

**Table 1** shows high correlation between σ<sup>x</sup> and the signals yaw, x, and y. d appears to have secondary importance, although it proves to be relevant in training trajectories 3 and 4, when compared to 1 and 5 for instance. These differences are associated with the mean value and mean absolute value of the speed, observed to be higher in trajectories 3 and 4. In contrast, θ correlation seems to be negligible.

The correlation coefficients change substantially when analyzing variables increments. The weight of 1x and 1y weights reduce, with respect to the original variables and 1yaw becomes practically independent to the output. 1y shows stronger relationship with 1σ<sup>x</sup> when compared to 1x. In contrast, 1yaw are only tangible in trajectories characterized by substantial direction changes, as happens in trajectories 4 and 5. 1d also loses relevance when compared to the absolute variable analysis, and θ influence is barely affected and kept negligible. When focusing on absolute values of the increments, the correlation results produce similar values compared to relative increments, supporting a priori hypothesis over the uncertainty independence with respect to the sign of the movement. A part from the magnitude of the correlation coefficients, the sign can be also interpreted. The broad variety of values and sign within trajectories prevents from selecting a single preferred combination of training signals, reason why several candidates are selected to further pinpoint the suitability.

As a final remark, it is worth highlighting that 1y presents larger correlation with σ<sup>x</sup> than 1x, and vice versa. That is, in trajectories with more movements in x direction, σ<sup>y</sup> grows quicker than σ<sup>x</sup> and similarly, in trajectories with larger displacement in y direction, σ<sup>x</sup> grows quicker than σy. This is observed in all trajectories with exception of trajectory 4, where both uncertainties are similar probably due to the followed direction in repeated circles. An explanation of this phenomenon might be found, in the relative amplitude of the actual displacement every sampling and the error magnitude. Whilst the error might be negligible after a large displacement, it could be of the same order of magnitude of short movements, causing larger distortion in the vehicle location. Consequently, large growth of σ<sup>x</sup> could be associated to low 1x instead of being related to 1y as it was initially deduced from the results of the analysis.

## 3.4. Delayed Signals Correlation

Inputs to output correlation analysis is complemented with the signals delay study, also evaluated using the Pearson coefficient. The aim of this test is to determine the possible relationship between old inputs and current outputs; that is, the influence of past changes in the vehicle movement and location on the accumulation of the current uncertainty in the vehicle positioning. The first correlation test of delayed signals analyses the relationship between delayed inputs and current output. The steps used are 0, 1, and 2 sampling times. Next, in order to draw a holistic understanding of the signals interrelation, a second correlation test between current input signals and the same ones delayed 1 and 2 sampling steps is also analyzed.

The results of the first test show similar correlation between input and outputs with independence of the delay implemented. Nonetheless, the second test also shows strong correlation between inputs and the respective delayed inputs. Although from the first results it could be considered that the output depends on past input signals, the second analysis discredits this assumption as they could also be due to the high similarity between current and past inputs. Consequently, no conclusive assumptions can be derived from this correlation test.

The results from the second correlation test effectively show that the inputs and delayed version of the inputs are practically identical, and consequently show similar correlation with the output. As previously states, this similarity might be due to the small sampling step implemented with respect to the input signals variability in time. Further investigations should be conducted to arise conclusive answers to the previous hypothesis. Accordingly, additional study with respect to the delay effect of the feedback NN states is considered during training.

## 4. TRAINING SETS CANDIDATES

Hereby, a training set is defined as the union of a specific combination of input signals, obtained from a selected number of training trajectories in an enclosed array used as training data. That is, the training set is defined by the signals used between the candidates previously analyzed in the correlation analysis and the trajectories from which signals are extracted. The training sets can contain data proceeding exclusively from a single trajectory or from the combination of more than one. Furthermore, the same trajectory can be repeated in each set in more than one occasion by implementing different noisy version from the 1000 MC simulations, practice that intents to encourage the response robustness to the presence of noise in the inputs.

## 4.1. Training Sets

The input training sets are designed in terms of number of signals, trajectory characteristics and amount of trajectories used, and always contain noisy data so as to simulate with



TABLE 2 | Input sets candidates proposed for training and testing including: signals selected, data used, and hypothesis to verify/reject in the training results.


maximum fidelity real case studies. **Table 2** includes the training set candidates carefully designed to determine: the most suitable combination of inputs, optimal network structure and size and data variability requirements.

The second column in **Table 2** specifies the input signals used in each set, where abs and 1 indicates absolute value and signals increment, respectively. The amount of data used in each training set is detailed in the third column, alluding to the variability of trajectories used and amount of MC noisy versions of each trajectory. For instance, set 1 considers all trajectories repeated three times each; including therefore three MC noisy versions of each. The use of larger number of trajectories or specific ones is thoroughly defined, so as to reflect changes in the generalization capability with respect to training data variability. Consequently, by repeating noisy version of the same trajectory, the data variability should be much less than noisy versions of different trajectories.

The incremental variables specified in **Table 2** are calculated, with respect to the constant sampling rate in training sets. As an attempt to encourage generalization, alternative incremental inputs are proposed by using random dynamic sampling within the boundary of 1 to 9 sampling steps. Nonetheless, the networks trained with this data were not able to estimate the target variable, reason why they are neither included in **Table 2** nor in the test results. The failure to capture the process could be excused in the data variability and complexity introduced through variable sampling. The networks trained with this data were presumably required to emulate a behavior more complicated than the one described with constant sampling. Consequently, it might be the case that the amount of training data and network size used were not suitable to capture efficiently the underlying process. **Figure 7** illustrates a flow diagram that clarifies the design process and characteristics that define a training set. The input selection is partitioned in three stages: selection of the key signals combination, format of the signals preferred (real value, increments, or absolute increments) and trajectories used to extract the data.

## 4.2. Training Candidates

Sets 1 and 2, similarly to 3 and 4, compare the effect of the data variability on the generalization capacity by implementing identical input variables, but using data from different trajectories. Sets 2 and 3 determine the effect of the inputs sign again with respect to generalization, and intend to provide numerical support to proof the independence between uncertainty accumulation and movement direction. Sets 5 to 8 implement identical data, but use different input candidates so as to obtain the optimal signals combination.

#### 5. NN DESIGN AND TRAINING

The NN candidates are compared in terms of: network size, structural complexity and results output quality. The accuracy of the estimation is evaluated using various error measurements applied to both network output, 1σx, and target variable σx.

FIGURE 7 | Training sets generation from signals proceeding from the available trajectories including signals selection (from correlation analysis), signal processing (true value, increment, or absolute increment), and training trajectories used (data variability).

TABLE 3 | Training results comparison and analysis in terms of structural and data complexity, training time and performance, and accuracy indexes as convey for error evaluation.


#### 5.1. Error Measurements

The results are evaluated by using Root-Mean-Square (RMS) error and relative error measures between the network output and reference data, GT variables. The error measurement include: RMS of σ<sup>x</sup> and relative error of the accumulated uncertainty obtained at the end of the trajectory, σx(end). The first assesses the actual performance of the network given the fact that the target variable is the increment of the uncertainty, 1σx, and not the cumulative one, σx. The second, σ<sup>x</sup> RMS, evaluates the variable of interest and determines the possible predicted error accumulation and the actual one. Finally, the relative error of the final cumulative uncertainty analyzes how well the NN would perform in case a new sensor becomes available at the end of the vehicle path and fusion is required.

These indicators are calculated for each of the trajectories separately, so that it is possible to compare the performance in both, data used for training and data not seen before. It is also worth mentioning that the data used for training in all cases consist of 70% of the total amount that defined the training set, whilst the rest is used for testing and validation during the training process. None of the sets or NN configurations converge during the training process, when using the LR architecture as a consequence of the noisy characteristics of the data used. Nonetheless, this behavior is not necessarily detrimental due to the fact that the generalization capability is preferred to the estimation accuracy of a specific trajectory; it is of importance to avoid noise fitting. Consequently, the networks are trained up to a certain performance value or number of iterations, epochs, and early stopping is used before the training gradient stabilizes. The output to estimate is the relative increment of the error in x std in all cases, 1σx.

#### 5.2. Training Algorithms

Two algorithms are used for training, Levenberg-Marquardt (LM) and Scaled Conjugate Gradient (SCG). These aim to compensate from deficiencies in terms of robustness, and convergence time of the well-known Error Backpropagation (EBP) and Gauss-Newton algorithms (Moller, 1993; Yu and Wilamowski, 2011). Both differ in the selection of the step size and direction during convergence. Ideally, longer steps should be implemented at early stages and gradually smaller ones should be considered to encourage the result finesse in later stages. Moreover, the error shape might also change, affecting simultaneously to the optimal step direction. SCG implements

optimized step size and direction, whilst LM alternates EBP and Gauss-Newton methods depending on the error shape. LM combines the advantages of both strategies taking advantage of the speed convergence of Gauss-Newton with quadratic error, and the robustness of EBP convergence behavior under conditions of non-advantageous for Gauss-Newton.

The training results usually benefit from LM when compared to SCG throughout the test cases. Furthermore, LM presents low µ values, variable that determines the alternation between methods, but it converges neither into Gauss-Newton nor into the steepest descent method.

## 5.3. Training Results: Input Signals Selection

**Table 3** summarizes the results obtained after training specific network structures, second column, with the training sets enumerated in the first column. Training set 2 is used to compare FF and LR networks. In all cases, the training with identical set and structure is repeated in more than one occasion, typically up to six times. This practice is recommended due to the possible effect that the random weights initialization can cause over the end solution, which can potentially be trapped into local optima. The results included in **Table 3** are taken from the best network obtained after training several candidates. These figures are compared between equal amounts of iterations.

Sets 1 and 2 are omitted for brevity, as the conclusion coincides with the analysis of sets 3 and 4. Set 2 is used to implement FF and FB configurations as included in the first two rows of **Table 3**. The noise filtering capability of FBNN improves the estimation accuracy notably, reason why the LR structure is concluded as the most suitable and successively used in the subsequent training sets.

FIGURE 10 | (Top) x and y coordinates of trajectory 1; (Bottom-Left) 1σ*x* in trajectory 1 with respect to the sensor sampling; (Bottom-Right) 1σ*x* in trajectory 1 with respect to the sensor sampling.

Six networks trained with set 2 are compared to six networks trained set 3, to determine the effect of sign in the generalization capacity. The best candidate from set 3 shows that the error is 30% lower compared to the best candidate provided by set 2. The results evidence the benefits of reducing non-relevant data in the input set, and corroborate the independence of the location error accumulation with respect to the movement direction.

Sets 3 and 4 evaluate the relationship between generalization and variability of the training sets. The candidates are trained for similar number of epochs using the same structure and size. Networks trained with set 4 are expected to have improved generalization capacity, when compared to the ones trained with set 3, oppositely to the results detailed in rows four and five. These results are not conclusive due to the deficient amount of training time allowed for set 4 network candidates, but show evident differences between networks complexity when trained with each set. Larger data variability would presumably imply also higher network complexity, and therefore longer training time and number of epochs.

Sets 5 to 8 implement the same data variability, but use different combination of input signals. The training is programmed for similar number of epochs in all cases, and therefore the results are expected to benefit simpler training sets. By comparing sets 5 and 7 it can be deduced that the yaw is preferred to θ, and therefore set 6 does not benefit from the additional information. Nevertheless, the result of set 8 do Zhang et al. Neural Network Based Uncertainty Prediction

not corroborate this hypothesis, reason why extra training is programmed with sets 9 and 10. These last sets incorporate maximum data variability, and are trained for a larger number of iterations up to a satisfactory performance. As expected from the preliminary results obtained in sets 5, 6, and 7, the combination of inputs used in sets 5 and 9, 1x, 1y, and 1yaw, is superior to the other candidates. These results were already anticipated by looking into the relationship between coordinates x and y and odometry signals d and θ, which geometrically dependent through (1) and (2). Consequently, it is reasonable to assume that the combination of any of the previous pairs would not provide extra information to the training set. In contrast, yaw proceeds from a three dimensional displacement incorporating new data that seems to the valuable for uncertainty prognosis.

## 5.4. Delay Effect on the Training Results

The effect of the feedback delay over the estimation accuracy is studied in the LR configuration by implementing: 1 sample delay, 2 samples delay and the combination of both. The results obtained do not vary substantially with incremental delays, which supports the hypothesis of independence between delayed inputs and current outputs. Nevertheless, the output precision benefit in all cases from the noise filtering effect of feedback structures, reason why 1 step delayed is selected.

**Figure 8** presents the structure used to find the optimal hidden layer size, where n corresponds to the number of neurons in the hidden layer.

#### 5.5. Hidden Layer Optimal Size

Further training with set 9 using 20, 30, 40, and 50 neurons was programmed to determine the optimal hidden layer size. Networks with 20 and 30 neurons presented unacceptable error measures as they were not able to capture the process complexity. Fifty neurons networks were able to accurately estimate the cumulative uncertainty in most of the trajectories, but presented inconsistent behavior in some cases. The estimation results obtained with 40 and 50 neurons networks are visually compared in **Figures 9**, **10**. These illustrations are formed by three graphs, the trajectory shape at the top level and the uncertainty estimation at the low level, including the uncertainty increment of the left and the accumulation on the right.

**Figure 9** illustrates trajectory 0, in which the 50-neurons network returns improved results when compared to the 40 neurons networks. Higher number of neurons are able to filter the noisy inputs more effectively, as illustrated in the left graph, and seem to follow better the uncertainty increment, almost matching the cumulative value at the end of the trajectory. The estimation results of the 40-neurons network also match the increments in uncertainty and the shape of the accumulated error, but it is not able to effectively filter the noise. It could be deduced from the previous results that the more the noise is filtered, the better the estimation accuracy obtained. Nevertheless, this is not the case observed when analyzing trajectory 1 as illustrated in **Figure 10**. Although again 50 neurons networks filter the noise in the uncertainty increments in the left graph, the tendency of the cumulative uncertainty diverges from the target variable causing inconsistent behavior. TABLE 4 | Comparison of training sets 9 and 11 in terms of set characteristics and testing results in trajectories not used for training.


Oppositely, 40-neurons networks are able to both filter the noise and follow the cumulative uncertainty tendency, returning reasonably accurate results at the end point of the trajectory.

Although 50-neurons network are able to return very accurate results in most of the training trajectories, they show inconsistent behavior at times, which notably diverge from the target. As previously stated, the generalization capability primes in front of the estimation accuracy in the specific application of sensor fusion. Consequently, 40-neurons networks are considered to be the best candidate to model the uncertainty increment, and are therefore considered as reference size in the following tests.

These results agree with the so-called Ockham's Razor principle, which prefers simpler networks structures able to provide acceptable level of accuracy, rather than complex and more accurate ones. The 50-neurons network is able to capture higher non-linear process than the smaller versions. This extra modeling capacity could be either trained to fit the process more accurately, or to capture other processes such as inputs noise, excusing the divergent results depicted in **Figure 10**. Nonetheless, it is not a suitable network size for the characteristics of the available training data.

## 6. TEST RESULTS

Given the fact that the NN has seen 70% of the data used for testing, although different version of the noisy trajectories are used, these still share similar characteristics which might

prevent from yielding final conclusions. Furthermore, NNs can be considered as blackbox models whose robustness cannot be tested with conventional methods. In order to further analyze the candidate performance, 28 new trajectories combining straight lines, sharp turning and winding routes along urban areas are used for testing.

#### 6.1. Set Candidate 9

Likewise to the methodology followed to process the training trajectories, the test trajectories are converted into noisy features emulating data collection through noisy sensors. Again 1000 MC noisy versions are generated to model the uncertainty accumulation along the path. The estimation performance of the 40-neurons network is tested in all 1000 noisy version of each of the 28 trajectories so as to obtain the average error: RMS of 1σx, RMS of σx, and end error of the cumulative uncertainty. Although the estimation results are in average satisfactory, the network candidate is not able to fit error accumulation with appropriate accuracy in all test cases. Due to the characteristics of blackbox model of the NN, it is difficult to predict in which case scenarios the network will be able to capture the uncertainty growth.

**Figures 11**–**13** illustrate the results obtained in test trajectories 4, 24, and 12, respectively. These contain three graphs including the trajectory coordinates, the cumulative uncertainty and the yaw signal respectively from top to bottom. The middle graph illustrates in red the targeted cumulative uncertainty and 1000 estimation outputs of the network in blue. Although the cumulated uncertainty is well-captured in trajectories 24 and

12, this is not the case of trajectory 4. These results can be explained through the values of the yaw signal in this test scenarios. The first 200 sampling steps are characterized by close-to-zero yaw, followed by a large increment in yaw and a close-to-constant value until the end of the route. When looking into the tendency of the accumulated uncertainty, the shape seems to match the growth only in the last 200 sampling steps. This behavior is observed in some of the testing trajectories, suggesting a consistent response. Furthermore, when assessing the training trajectories, no scenario with large yaw changes is observed. As a consequence of this deficient training data, the network is not able to capture the uncertainty when these conditions take place in the testing trajectories.

In contrast to testing trajectory 4, the trajectories 24 and 12 do not present zero yaw values and maintain it mostly constant along the whole path. As a consequence, the network is able to accurately fit the uncertainty accumulation, with accuracy acceptable to allow for sensor fusion at any point of the route. It can be concluded from the previous that the generalization capacity of the network could be potentially improved provided that the training set includes the test cases missing.

## 6.2. Set Candidate 11

The deficiencies of training set 9 data variability are corrected in candidate 11. This implements the same input signals, absolute value of 1x, 1y, and 1yaw, but includes a larger amount of trajectories and therefore a larger number of case scenarios that include possible changes in the yaw signal, not previously captured. Whilst set 9 only considers data from the training trajectories 0 to 5, set 11 also includes some of the trajectories previously used for testing; test trajectories 1, 2, 3, 4, 9, 14, 22, and 24. **Table 4** contains the details of the sets characteristics, data used, network structure, training and testing results. This table includes the results corresponding to the best network trained with sets 9 and 11. Again, new networks are trained with identical structure and data, and the best is selected to avoid deceiving results caused by random weights initialization. The results included in **Table 4** evaluate the networks performance in all testing trajectories, including the ones also used for training in set 11.

In average, the prediction accuracy of the increment in the uncertainty, 1σx, is identical in both cases when analyzing the average accuracy in all test trajectories. This result suggest that the 40 neurons LRNN has reached a performance limit with set 9 and does not admit the further complexity provided in set 11. Moreover, set 9 presents better accuracy when analyzing the average RMS error in σ<sup>x</sup> and worst results when comparing the end error.

When looking into the trajectories used to train set 11, as expected the results provided by the network trained with set 9 are worse. In this case set 11 has the advantage of having implemented 70% of those trajectories during training. Nevertheless, the accuracy of the network trained with set 11 is worse when looking exclusively to the trajectories not used in any of the sets. This result could be explained considering the loss of generalization capability, when the training data complexity overcomes the non-linear capacity of the network structure.

The networks results are visually compared in **Figures 14**, **15**, where test trajectories 6 and 12 are illustrated. None of these trajectories have been used to train any of the networks and therefore, the results can be interpreted as pure testing. The uncertainty estimation in the 1000 MC using the network trained with set 9 is represented in green, whilst the respective results of the network trained with set 11 are illustrated in blue. The target curve is illustrated in red in both cases. **Figure 14** shows how set 11 outperforms set 9 prediction results, whilst **Figure 15** illustrates the opposite case. Although the results of both networks are rather similar in terms of accuracy, set 11 prediction in the 1000 MC noisy versions of each trajectory are less spread than the equivalent ones from set 9. The larger variability of data used in set 11 seems to have the effect to improve the prediction robustness to noise, and therefore reduce the variability of the prediction when noisy versions of the same trajectory are used. It can be deduced that the consistency of the results is improved when implementing sets with larger variability due to improved noise robustness.

After identifying the most suitable set for training, the optimized number of neurons was investigated. The previous results suggested that the optimal number of neurons is close to 40, probably situated between 40 and 50 neurons. Therefore, further tests were performed using 38, 42, 45, and 47 neurons. The training was maintained until stabilization of the networks performance and allowing generally higher number of epochs for larger sizes. The results verify the a priori hypothesis and situate the best candidates within 38 and 42 neurons. In particular the 42-neurons candidate improves the RMS error in 14 and 18% the 40 and 45-neurons candidates respectively. The cumulative RMS and end errors are also noticeable improved. The RMS error is improved by 0.0296 and 0.0184 and the end error by 0.161 and 0.457, when compared to the 40 and 45-neurons networks respectively. It can be therefore concluded that the optimal network size is 42-neurons hidden layer.

## 7. CONCLUSIONS

This paper proposes a strategy for feature uncertainty estimation directly from data without prior knowledge of the sensors characteristics. NNs learning capability of nonlinear processes is tested in the particular application of vehicle location through odometry measurements. Both input set and network structure design are based on training and testing results obtained with various neural network candidates. The final results confirm NNs as suitable surrogate modeling technique robust to changes in the testing data, inputs noise and variable case scenarios, provided that the training data captures enough data variability and the network size and structure complexity is able to resemble the process non-linear characteristics.

## AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

#### REFERENCES


#### ACKNOWLEDGMENTS

This work was supported by the National Natural Science Foundation of China (NSFC) under Grants 61703335, the Fundamental Research Funds for the Central Universities under Grants G2017KY0302, and the 111 Project under Grant B18041.

Struct. Multidiscipl. Optim. 23, 1–13. doi: 10.1007/s00158-001-0 160-4


**Conflict of Interest Statement:** CM was employed by company Porsche Engineering Services GmbH, DC was employed by company Cogsense Technologies Limited. All other authors declare no competing interests.

Copyright © 2019 Zhang, Martinez, Clarke, Cao and Knoll. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# SVM-Based Classification of sEMG Signals for Upper-Limb Self-Rehabilitation Training

Siqi Cai <sup>1</sup> , Yan Chen<sup>1</sup> , Shuangyuan Huang<sup>1</sup> , Yan Wu<sup>2</sup> , Haiqing Zheng<sup>3</sup> , Xin Li <sup>3</sup> and Longhan Xie<sup>1</sup> \*

*<sup>1</sup> Shien-Ming Wu School of Intelligent Engineering, South China University of Technology, Guangzhou, China, <sup>2</sup> A* <sup>∗</sup>*STAR Institute for Infocomm Research, Singapore, Singapore, <sup>3</sup> The Third Affiliated Hospital, Sun Yat-sen University, Guangzhou, China*

Robot-assisted rehabilitation is a growing field that can provide an intensity, quality, and quantity of treatment that exceed therapist-mediated rehabilitation. Several control algorithms have been implemented in rehabilitation robots to develop a patient-cooperative strategy with the capacity to understand the intention of the user and provide suitable rehabilitation training. In this paper, we present an upper-limb motion pattern recognition method using surface electromyography (sEMG) signals with a support vector machine (SVM) to control a rehabilitation robot, ReRobot, which was built to conduct upper-limb rehabilitation training for post-stroke patients. For poststroke rehabilitation training using the ReRobot, the upper-limb motion of the patient's healthy side is first recognized by detecting and processing the sEMG signals; then, the ReRobot assists the impaired arm in conducting mirror rehabilitation therapy. To train and test the SVM model, five healthy subjects participated in the experiments and performed five standard upper-limb motions, including shoulder flexion, abduction, internal rotation, external rotation, and elbow joint flexion. Good accuracy was demonstrated in experimental results from the five healthy subjects. By recognizing the model motion of the healthy side, the rehabilitation robot can provide mirror therapy to the affected side. This method can be used as a control strategy of upper-limb rehabilitation robots for self-rehabilitation training with stroke patients.

#### Edited by:

*Changhong Fu, Tongji University, China*

#### Reviewed by:

*Zhaohui Xia, Rensselaer Polytechnic Institute, United States Dongdong Zheng, National University of Singapore, Singapore Qinyuan Ren, Zhejiang University, China*

#### \*Correspondence: *Longhan Xie*

*melhxie@scut.edu.cn*

Received: *12 February 2019* Accepted: *09 May 2019* Published: *04 June 2019*

#### Citation:

*Cai S, Chen Y, Huang S, Wu Y, Zheng H, Li X and Xie L (2019) SVM-Based Classification of sEMG Signals for Upper-Limb Self-Rehabilitation Training. Front. Neurorobot. 13:31. doi: 10.3389/fnbot.2019.00031* Keywords: surface electromyography, support vector machine, rehabilitation robot, upper limb, motion pattern recognition

#### INTRODUCTION

Stroke is the leading cause of adult disability around the world (Burton et al., 2017), with upper-limb motor impairments being the main factor influencing the quality of life in stroke survivors (Stinear et al., 2017). Repetitive motor training on movement has a notable curative effect on the restoration of arm function in stroke patients, and the patients' degree of recovery is positively influenced by treatment intensity (Steven et al., 2006; Gittler and Davis, 2018). Conventionally, stroke patients usually rehabilitate with the assistance of therapists. However, the involvement of therapists is challenging because rehabilitation training is a time-consuming and labor-intensive process. Many stroke survivors experience upper-limb impairment with few rehabilitation opportunities due to a lack of rehabilitation therapists. Robotassisted therapy devices, which can provide the affected arm with high intensity and repetitive treatment, have been increasingly used in rehabilitation training and can potentially enhance upper-limb functional recovery in stroke survivors (Yoo and Kim, 2015; Veerbeek et al., 2017).

Various rehabilitation robotic devices have been developed for upper-limb training in stroke patients. Among them, MIT-Manus (Krebs et al., 1999) was one of the first systems to be developed and can provide stroke survivors with plane movements. Furthermore, MIME (Lum et al., 2006), GENTLE/s (Coote et al., 2008), T-WREX (Domien et al., 2011), and NEREBOT (Stefano et al., 2014) were proposed to permit threedimensional exercise training for patients with impaired arms.

Different control strategies have been developed and applied to rehabilitation robots for the recovery of the affected arm. Motion parameters of the patient's arm are one of the major inputs in the rehabilitation robot's control system. Many types of mechanical inputs, such as switches (e.g., Aubin et al., 2013; Artz et al., 2015), force sensors (e.g., Diftler et al., 2014), and computer vision (e.g., Taati et al., 2012), have been used as feedback in the controllers of rehabilitation robots. The surface electromyography (sEMG) signal, which is composed of the action potentials from groups of muscle fibers, is one of the major sources of information about neural control and can reflect the degree of activity of the muscles (Yang et al., 2016). During the rehabilitation training of the upper limb, sEMG signals can be captured, interpreted, and used as input for the control algorithms of rehabilitation robots (Rosen et al., 2001; Kiguchi and Hayashi, 2012; Peternel et al., 2016). Considering that rich motor control information and the user's intention can be detected from sEMG signals, the sEMG-based control scheme is one of the most appropriately suited approaches for upper-limb rehabilitation robots (Singh et al., 2014).

However, the sEMG signal is affected by many factors and is not stable, which can lead to low accuracy in recognizing patient motion intentions. Some studies have shown that machine learning techniques can be employed for classifying different tasks and improving the robustness and accuracy of the identification and classification of arm movements through the exploitation of sEMG signals (Lucas et al., 2008; Young et al., 2013; Suberbiola et al., 2015). The SVM algorithm is a well-established technique to learn how to classify new data starting from a collection of classified events and has been widely applied in machine learning problems (Vapnik, 1995; Suykens et al., 2015) and sEMG processing (Song et al., 2007) because of its simplicity and robustness. With the determination of a few additional tuning parameters, SVM solutions are characterized by a convex quadratic optimization problem (Platt, 1999). Considering that the availability and quality of sEMG signals can vary from patient to patient, it is difficult to obtain a large number of training samples. SVM is suitable for solving learning tasks where the number of attributes is large relative to the number of training examples (Suykens et al., 2002).

Aimed at developing a control strategy for upper-limb rehabilitation robots with the capacity to understand the intention of the patients and provide the corresponding rehabilitation training, this paper proposed an sEMG-based control framework based on SVM classifiers for intention identification of the upper limb. The control strategy was applied to the upper-limb rehabilitation robot, ReRobot, to perform the rehabilitative exercise training. The motion of the patient's healthy side is first recognized from the measured and processed sEMG signals, and then the ReRobot assists the affected side in conducting the corresponding rehabilitation therapy. Based on the developed sEMG-based control strategy, self-rehabilitation training in stroke patients can be conducted.

## METHODS

#### Data Collection

Many stroke patients have trouble moving their upper limbs on the affected side, and they must receive much rehabilitation training to recover motion ability (Merletti et al., 1999). Stroke patients often show abnormal shoulder motor ability, so shoulder rehabilitation actions, including shoulder forward flexion, shoulder level adduction, and shoulder level abduction (Chen and Zhou, 2015), should be carried out.

The coordinates were defined where the coronal axis of the patient is the X-axis, the sagittal axis is the Y-axis, the vertical

axis is the Z-axis, and the acromioclavicular joint is the origin of the coordinates. Five motions were used to test the performance of this model, including shoulder flexion, abduction, internal rotation, external rotation, and elbow joint flexion, as shown in **Figure 1**.

Five healthy subjects (N = 5, age 25 ± 4 years, body mass 70 ± 5 kg, height 174 ± 6 cm, all male and all right-handed) participated in the experiments. All subjects gave their informed consent before participation. The experimental procedures were conducted in accordance with the Declaration of Helsinki and approved by the Ethic Board of Medical School, South China University of Technology. Each subject performed five repetitions in accordance with five standard motions. Since stroke patients are mostly elderly people with lower motion abilities on their healthy side compared to young adults, the designed motions were imitated as movements of elderly stroke patients on their healthy side.

For each test, the test subject did not carry weight, and the movement lasted for 1–2 s. After the end of each movement, the subject took at least 1 min to rest to prevent muscle fatigue. There are many muscles involved in the movement of the shoulder and elbow joints. Muscles play two roles in the movement process: proximal stability and distal activity. In this paper, eight superficial muscles involved in the distal shoulder and elbow movements were selected as monitoring objects. To ensure safety, patients controlled the emergency stop of the equipment according to their own comfort level. In this paper, the patient's clenched fist is used as the emergency stop action, and the flexor radialis is used as the detection channel for the emergency stop action. Therefore, a total of nine muscles were selected as test objects. The sEMG signals of nine muscles in the upper limb were acquired, including the middle deltoid, anterior deltoid, pectoralis major, biceps brachii, brachioradial muscle, ulnar flexor carpal, trapezius, posterior deltoid, and triceps brachii. The first eight muscles were used to evaluate how the muscle works while the signal changes. The ulnar flexor carpal played a role in the subject's self-initiation of the safety protection mechanism. When the subject felt uncomfortable during the test, he or she could stop the robot by clenching his or her fist and activating the ulnar flexor.

**Figure 2** shows the experimental setup. A 16-channel sEMG acquisition instrument with 1-kHz sampling frequency in each channel was used. Each channel was related to a three-channel differential electrode. After the muscle was disinfected by alcohol, the electrodes were placed along the direction of the muscle abdomen with an interval of 2 cm. **Figure 3** shows the raw sEMG signals of the eight muscles without preprocessing from one of the five healthy individuals.

## Data Processing

#### Preprocessing

sEMG signals are easily disturbed by the external environment in the acquisition process. Motion artifacts, baseline offset, and power frequency interference may all lead to distortion of the sEMG signals, which leads to poor classification accuracy (Chen et al., 2015). Data preprocessing methods of baseline correction, 20–500 Hz bandpass filter, power frequency filter (50 Hz notch), full-wave rectification, and amplitude normalization were carried out to improve the signal-to-noise ratio (SNR) of the sEMG signal. All filters used in this paper are fourth-order Butterworth filters.

#### Signal Segmentation

Although the intensity of the sEMG signal detected in each channel is different in the different movements, the signals show good synchronization (Zhang and Zhou, 2012): if the related muscles did not contract, then the sEMG signal showed a stable low-amplitude signal before the test; in contrast, the signal changed dramatically in the course of executing the action. The characteristic of this type of signal was that the signal could be segmented by a sample entropy algorithm (Liu and Zhou, 2013). The sample entropy algorithm is an efficient and time-consuming algorithm that can avoid the signal deviation caused by self-matching. Therefore, we adopted the sample entropy algorithm for data segmentation.

In the experiments, the action signals of eight channels were collected, and the muscle signal for the sample entropy analysis was from the sum of the eight-channel signals, which is:

$$\text{sEMG} \left( \text{t} \right) = \sum\_{i=1}^{M} \text{sEMG}\_i \text{(t)} \tag{1}$$

where M is the total number of channels, sEMGi(t) is the tth value of channel i, and sEMG is the sum of all channel signals.

The sample entropy can be calculated as follows:

$$\text{SampEn}\,(m, r, L) = -\ln\left[\frac{B^{m+1}(r)}{B^m(r)}\right] \tag{2}$$

where m is the dimension of the sEMG signal, r is the similar tolerance, L is the length of the muscle signal, and B <sup>m</sup> (r) is the probability of the two signal sequences

matching m points. In this study, we set m = 2 and r = 0.25∗σ, where σ is the standard deviation of the sEMG signal.

$$s^\*(n) = \begin{cases} 0, \text{ |SampEn < d|} \\ 1, \text{ |SampEn \ge d|} \end{cases} \tag{3}$$

where d (d = 0.6) is the threshold and s(n) is the judgment function of the EMG signal. When s(n) = 1, it is the effective part of the action; when s(n) = 0, it is the invalid part of the action, as shown in **Figure 4**.

#### Feature Extraction and Classification

Because of the short-term stationarity of sEMG signals, the signals need to be divided into frames. To prevent spectrum

leakage, window functions should be used for interception. Compared with window functions such as the rectangular window and triangular window, the Hanning window has the characteristics of fast side lobe attenuation and is suitable for non-stationary signals. Therefore, this paper adopts a Hanning window for framing. Hanning windows with window lengths ranging from 30 to 300 ms (Chowdhury et al., 2013) were used to extract the characteristics of the sEMG signals. To ensure the implementation of the system and the stability of the classification, 128 ms was selected as the window length, and the sliding step size was 64 ms. The root mean square (RMS), fourth-order autocorrelation factor, wavelength, variance, absolute mean, and short-term energy of each window in each channel are calculated as follows:

$$\text{RMS}\_{kj} = \sqrt{\frac{1}{N} \sum\_{i=1}^{N} \text{(sEMG}\_{ij}\text{)}^2} \tag{4}$$

$$\text{VAR}\_{k\bar{\jmath}} = \frac{1}{N} \sum\_{i=1}^{N} \left( \text{sEMG}\_{\bar{\jmath}\bar{\jmath}} - \overline{\text{sEMG}\_{\bar{\jmath}}} \right) \tag{5}$$

$$\text{MAV}\_{k\dot{\jmath}} = \frac{1}{N} \sum\_{i=1}^{N} \left| sEMG\_{\dot{\jmath}} \right| \tag{6}$$

TABLE 1 | Classification performance—five types of motions.


$$\text{SSI}\_{kj} = \sum\_{i=1}^{N} \text{(sEMG}\_{ij}\text{)}^2 \tag{7}$$

where k represents the kth window and j is the jth channel.

#### Support Vector Machine

Support vector machine (SVM) is a machine learning method based on statistical learning theory. The characteristic behavior of SVM is to construct a high-dimensional hyperplane for small samples and non-linear models and to classify samples by calculating the maximum distance of training data points on the hyperplane (Ma et al., 2004). Due to the physical limitations of stroke patients, the sample size of the data that can be collected is small. In small-sample model training, SVM has advantages of higher stability and fewer training parameters (Raczko and Zagajewski, 2017). Therefore, SVM is a better choice than a neural network. The equation solved by the SVM algorithm after the Lagrange operator can be expressed as:

$$\begin{cases} \min \frac{1}{2} \|\mathbf{w}\|^2 + C \sum\_{i=1}^N \xi\_i \\ s.t. \mathcal{y}\_k(\mathbf{w} \bullet \mathbf{x}\_k + b) \ge 1 - \xi\_i, k = 1...N \end{cases} \tag{8}$$

where (**xk**, y<sup>k</sup> ) represents the training data of the kth window. ξi is a slack variable, which represents the magnitude of the classification error.

The radial basis kernel function can be expressed as (Chung et al., 2003):

$$\kappa(\mathbf{x\_i}, \mathbf{x\_j}) = \exp\left\{ \frac{\left\| \mathbf{x\_i} - \mathbf{x\_j} \right\|^2}{2\delta^2} \right\} \tag{9}$$

The penalty factor C and kernel function parameter δ are the main parameters that affect the performance of the model. Therefore, while training the model, the penalty factor C and kernel function parameter δ should be optimized. To optimize parameters C and δ of the model, the genetic optimization algorithm is used. The accuracy of the time series prediction is selected as the fitness function. The optimization steps of SVM parameters based on the genetic algorithm are shown in **Figure 5**.

To test the feasibility of the genetic optimization algorithm, simulations were carried out using MATLAB. Five healthy subjects (N = 5, age 25 ± 4 years, body mass 70 ± 5 kg, height 174 ± 6 cm, all male and all right-handed) were selected for shoulder flexion, abduction, pronation, and elbow flexion. Each action was performed five times for data classification and recognition.

The default value of MATLAB is C = 1; σ = 1/numfeatures, where numfeatures is the number of features. In this paper,

numfeatures = 40. The value after optimization is C = 0.4579 and σ = 371.6339. The influence of the parameter optimization is significant. Compared with the default value, the optimized value has higher classification accuracy. When the default parameters are used, the classification accuracy is 78.53%. After parameter optimization, the classification accuracy reached 94.18%.

## UPPER-LIMB REHABILITATION ROBOT PLATFORM

#### Robot System

The ReRobot is a rehabilitation robot platform developed to permit training of the upper limb in three-dimensional space, as shown in **Figure 6**. The platform is set up to support and guide the movement of the affected arm using a UR5 robot arm. UR robotic arms are lightweight, fast, easy to program, flexible, and safe robotic arms with 6 DOF (Kebria et al., 2017). The configuration can provide positioning and orientation to a patient's upper limb in the training tasks. The transmission control protocol/internet protocol (TCP/IP) was used to communicate with the robot and the MATLAB user interface. After attaching to the forearm of a stroke patient, ReRobot focuses on the rehabilitation exercise of the shoulder and elbow joints in accordance with the range of movement of the human arm, including shoulder flexion/extension, shoulder abduction/adduction, shoulder internal/external rotation, elbow joint flexion/extension, and forearm supination/pronation. The rehabilitation data, e.g., data that include the sEMG, forces, velocities, and positions, are used for analysis and ensure the safety of subject is collected in real time during the exercises. To ensure a comfortable fixation to the patient's arm, a gas bag is used as the buffer device at the attachment point with the human arm.

Safety is an important issue in the ReRobot therapy system. The safety system consists of several components, including an emergency stop switch, force sensor stops, a hand-controlled switch, and an sEMG signal stop. The emergency stop button, which is held by the experimenter, cuts power to the robot and shuts down all systems. The six-axis force sensor located at the robot arm measures interaction forces generated during the tasks. If the interaction force is abnormal, the power of the system would be automatically cut off. The hand-controlled switch, which is held by the subject, stops the movement of the robot. In the event of any abnormality, the robot stops and the patient's arm can be easily removed from the end of the ReRobot by a freely actuating mechanism and removable ends.

## Control Scheme Based on the SVM Classification of sEMG Signals

The control system of the ReRobot system is intended to develop a human–machine interface (HMI) that is able to activate the device as soon as the patient's motion intention is detected. By using the SVM system, the upper-limb movements of the healthy side can be detected and classified automatically through sEMG signals, and the ReRobot will then assist the impaired arm with that movement. The control scheme based on the SVM classification of sEMG signals was thus established, as shown in **Figure 7**.

This platform acquires the test subject's sEMG signals of the upper limb on the healthy side, and the preprocessed EMG signals from one to eight channels are used for upper limb motion recognition using the SVM classification method. The recognized action label signals are sent to the robot for motion calculation, and then the robot actuates the subject's affected arm with the position control method to perform the corresponding actions based on a presupposed trajectory.

When the patient feels discomfort in his or her arm or muscle abnormalities during the test, the whole system can be safely stopped by patient through the first clenching motion, and the sEMG signal from channel 9 is sent to the manipulator. At the same time, there is an emergency stop button in the test subject's hand throughout the test process to ensure safety.

## Experiment and Results

To test the feasibility of the upper-limb motion recognition method based on the SVM classification, simulations were carried out using MATLAB. Five movements of five normal people were collected 20 times. After processing feature extractions and label recognition of the collected data, 10 fivefold cross-validations were performed. The results are shown in **Figure 8**. The classification accuracy of each action is as follows: average recognition rate, 93.34 ± 0.59%; shoulder flexion, 92.95 ± 1.78%; shoulder external rotation, 91.44 ± 0.91%; shoulder internal rotation, 86.67 ± 1.98%; shoulder abduction, 95.98 ± 0.70%; and elbow flexion, 98.89 ± 0.42%. The chaotic matrix of one of the classifications is shown in **Figure 9**. The misclassification rate of shoulder internal rotation and shoulder external rotation is high. The main reason is that the muscles involved in the two movements have a high coincidence, so they are easily confused. However, the overall recognition rate is high, so the system can be used in the actual operation of the experimental platform.

Based on the chaotic matrix, three accuracy metrics (precision, recall, and F1-score) can be obtained (Sokolova and Lapalme, 2009). Precision describes the accuracy of the detection. Recall is the detection rate, which refers to how well the target objects are detected without being missed. The F1 score combines the precision and recall and provides a single measure of quality that is easy for end-users to understand. The precision, recall, and F1-score of SVM algorithm in classifying five different motions, including shoulder flexion, shoulder external rotation, shoulder internal rotation, shoulder abduction, and elbow joint flexion, were calculated to evaluate the performance, as shown in **Table 1**. The elbow joint flexion was detected with excellent performance (F1-score = 0.991), followed by shoulder abduction (F1-score = 0.961), shoulder flexion (F1-score = 0.921), and shoulder external rotation (F1 score = 0.881). The SVM-based classifier generally classified well and the average F1-score of five types of motions was 0.9368.

Five subjects (N = 5, age 25 ± 4 years, body mass 70 ± 5 kg, height 174 ± 6 cm, all male and all right-handed) participated in the rehabilitation training experiments, and all five subjects were able to complete robot-assisted voluntary exercises, as shown in **Figure 10**. Since these five subjects are all right-handed people, the sEMG signals of their right arms were acquired and processed. Based on the SVM classification method, their motion intentions were analyzed and used as input into the control schema of ReRobot. Thus, ReRobot can understand the desired movement of the subjects and facilitate its execution, thus providing active support to their left arms. A total of 100 actions were performed in the actual test, and 92 actions were correctly identified using the SVM-based method. The robot arm successfully assisted the subject's arm with recognized movements. Multiple security guarantees ensure the safety of the subjects in this process. The experimental results showed that SVM-based classification achieved good accuracy in rehabilitation training.

To verify the effectiveness of the safety switch based on the sEMG signal, a total of 25 tests of safety stops based on sEMG signals were performed. ReRobot had its cut power during each test, which proved the feasibility of using sEMG signals as an emergency stop in the rehabilitation system.

The experimental results showed that SVM-based classification can provide good accuracy in upper-limb motion pattern recognition and enabled patients to choose actions actively for rehabilitation. It is possible to use sEMG signals as an emergency stop button in the upper-limb rehabilitation system to ensure safety.

## CONCLUSIONS

In this paper, we investigated the feasibility of SVM classifiers for intention identification of the upper limb from sEMG signals. A new human–machine interface for self-rehabilitation training with stroke patients was developed. The upper-limb rehabilitation robot, ReRobot, could adequately understand the desired upper-limb movement and facilitate its execution, thus

## REFERENCES


providing active support to the impaired arm. Experiments with the ReRobot showed that the SVM classification based on sEMG signals can provide good accuracy in upper-limb motion pattern recognition when a time-dependent multifeature set was used.

In future research, this method to extract upper-limb intention from sEMG signals will be tested by experiments with stroke patients. The application of this classifier to upper-limb rehabilitation robots will be implemented to achieve successful clinical verification.

## ETHICS STATEMENT

This study was carried out in accordance with the recommendations of SCUT Research Ethics Guidelines and Researcher's Handbook, Ethic Board of Medical school, South China University of Technology with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the Ethic Board of Medical school, South China University of Technology.

## AUTHOR CONTRIBUTIONS

SC, YC, and LX contributed conception and design of the study. SH carried out the experiments. SC and YC performed the formal analysis and methodology part. SC wrote the first draft of the manuscript. YC, SH, YW, HZ, XL, and LX wrote sections of the manuscript. All authors contributed to manuscript revision, read and approved the submitted version.

## FUNDING

This work was supported in part by the National Natural Science Foundation of China (Grant No. 51575188), National Key R&D Program of China (Grant No. 2018YFB1306201), Research Foundation of Guangdong Province (Grant No. 2016A030313492 and 2019A050505001), and Guangzhou Research Foundation (Grant No. 201903010028).


Gittler, M., and Davis, A. M. (2018). Guidelines for adult stroke rehabilitation and recovery. JAMA 319, 820–821. doi: 10.1001/jama.2017.22036


inpatient stroke rehabilitation. Neurorehabil. Neural Repair 28, 377–386. doi: 10.1177/1545968313513073


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Cai, Chen, Huang, Wu, Zheng, Li and Xie. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Toward a Brain-Inspired System: Deep Recurrent Reinforcement Learning for a Simulated Self-Driving Agent

#### Jieneng Chen<sup>1</sup> \*, Jingye Chen<sup>2</sup> , Ruiming Zhang<sup>1</sup> and Xiaobin Hu<sup>3</sup> \*

*<sup>1</sup> Department of Computer Science, College of Electronics and Information Engineering, Tongji University, Shanghai, China, <sup>2</sup> School of Computer Science, Fudan University, Shanghai, China, <sup>3</sup> Department of Computer Science, Technical University of Munich, Munich, Germany*

An effective way to achieve intelligence is to simulate various intelligent behaviors in the human brain. In recent years, bio-inspired learning methods have emerged, and they are different from the classical mathematical programming principle. From the perspective of brain inspiration, reinforcement learning has gained additional interest in solving decision-making tasks as increasing neuroscientific research demonstrates that significant links exist between reinforcement learning and specific neural substrates. Because of the tremendous research that focuses on human brains and reinforcement learning, scientists have investigated how robots can autonomously tackle complex tasks in the form of making a self-driving agent control in a human-like way. In this study, we propose an end-to-end architecture using novel deep-Q-network architecture in conjunction with a recurrence to resolve the problem in the field of simulated self-driving. The main contribution of this study is that we trained the driving agent using a brain-inspired trial-and-error technique, which was in line with the real world situation. Besides, there are three innovations in the proposed learning network: raw screen outputs are the only information which the driving agent can rely on, a weighted layer that enhances the differences of the lengthy episode, and a modified replay mechanism that overcomes the problem of sparsity and accelerates learning. The proposed network was trained and tested under a third-party OpenAI Gym environment. After training for several episodes, the resulting driving agent performed advanced behaviors in the given scene. We hope that in the future, the proposed brain-inspired learning system would inspire practicable self-driving control solutions.

Keywords: self-driving agent, brain-inspired learning, reinforcement learning, end-to-end architecture, recurrence

## 1. INTRODUCTION

Recently, research in brain science has gradually received the public's attention. Given the rapid progress in brain imaging technologies and in molecular and cell biology, much progress has been made in understanding the brain at the macroscopic and microscopic levels. Currently, the human brain is the only truly general intelligent system that can cope with different cognitive functions with extremely low energy consumption. Learning from the information processing mechanisms of the brain is clearly the key to building stronger and more efficient machine intelligence

#### Edited by:

*Caixia Cai, Agency for Science, Technology and Research (A*∗*STAR), Singapore*

#### Reviewed by:

*Keyu Wu, Nanyang Technological University, Singapore Jacques Kaiser, Research Center for Information Technology, Germany*

#### \*Correspondence:

*Jieneng Chen chenjn@tongji.edu.cn Xiaobin Hu xiaobin.hu@tum.de*

Received: *25 February 2019* Accepted: *28 May 2019* Published: *28 June 2019*

#### Citation:

*Chen J, Chen J, Zhang R and Hu X (2019) Toward a Brain-Inspired System: Deep Recurrent Reinforcement Learning for a Simulated Self-Driving Agent. Front. Neurorobot. 13:40. doi: 10.3389/fnbot.2019.00040* (Poo et al., 2016). In recent years, some bio-inspired intelligent methods have emerged (Marblestone et al., 2016; Gershman and Daw, 2017; Hassabis et al., 2017; Botvinick et al., 2019), and they are clearly different from the classical mathematical programming principle. Bio-inspired intelligence has the advantages of strong robustness and an efficient, well distributed computing mechanism. It is also easy to combine with other methods.

The mammalian brain has multiple learning subsystems. Niv (2009) categorized major learning components into four classes: the neocortex, the hippocampal formation (explicit memory storage system), the cerebellum (adaptive control system), and the basal ganglia (reinforcement learning). Among these learning components, reinforcement learning is particularly attractive to research. Nowadays, converging evidence links reinforcement learning to specific neural substrates, thus assigning them to precise computational roles. Most notably, much evidence suggests that the neuromodulator known as dopamine provides basal ganglia target structures with phasic signals that convey a reward prediction error which can influence learning and action selection, particularly in stimulus-driven habitual instrumental behaviors (Rivest et al., 2005). Hence, many efforts have been made to investigate the capability of bio-inspired reinforcement learning by applying them to artificial intelligence-related tasks (Peters and Schaal, 2008; Mnih et al., 2015; Zhu et al., 2017; Gu et al., 2017).

In recent years, deep reinforcement learning has contributed to many of the spectacular success stories of artificial intelligence (Kober et al., 2013; Henderson et al., 2018). After the initial success of the deep Q network (DQN) (Mnih et al., 2013), a variety of improved models have been published successively. Later on and based on the former discoveries, Mnih et al. (2015) proposed the Nature DQN in 2015 and introduced the replay memory mechanism to break the strong correlations between the samples. Mnih et al. (2016) proposed a deep reinforcement learning approach, in which the parameters of the deep network are updated by multiple asynchronous copies of the agent in the environment. Van Hasselt et al. (2016) suggested the Double DQN to eliminate overestimation; they added a target Q network independent from the current Q network. It was shown to apply to large-scale function approximation (Van Hasselt et al., 2016). Wolf et al. (2017) applied a deep Q network to a driving scenario in a physics simulation based track. Newer techniques included deep deterministic policy gradients and mapping an observation directly to action, both of which could operate over continuous action spaces (Lillicrap et al., 2016). Schaul et al. (2016) suggested prioritized replay, adding priority to replay memory to relieve the sparse reward and slowly converge on the problem (Schaul et al., 2016). Reviewed in Hassabis et al. (2017), experience replay was inspired by theories that seek to understand how the memory system in the mammalian brain might interact, and thus has biological plausibility. In the case of partially observable states, the recurrent neural network (RNN) and long short-term memory (LSTM) have been proven to be effective in processing sequence data (Hochreiter and Schmidhuber, 1997). Hausknecht and Stone (2015) replaced the last fully connected layer in the network with an LSTM layer. They integrated information through time and replicated DQN's performance on standard Atari games and partially observed equivalents featuring flickering game screens. Also, Foerster et al. (2016) proposed to use multi-agent to describe a state distributively. A recent work (Kahn et al., 2018) adopted double Q learning with recurrency and computation graphs to tackle a robotics navigation task. Nevertheless, some previous studies, such as those with Atari games, focused on the simple environment and action space. Moreover, the previous studies do not provide comprehensive comparisons with supervised learning in a specific scenario. Because of these limitations, there is an urgent need to further improve the capability of deep reinforcement learning in a more challenging and complex scenario such as the simulated driving control problem.

To clarify the biological plausibility, Lake et al. (2017) state that there is indeed substantial evidence that the brain uses similar model-free RL learning algorithms in simple associative learning or discrimination learning tasks. In particular, the phasic firing of midbrain dopaminergic neurons is qualitatively and quantitatively consistent with the reward prediction error that drives updating of value estimates. In the process of reinforcement learning, the agent's attempt in each state was like the regulation process of dopamine in the brain (Dolan and Dayan, 2013). To the best of our knowledge, there is rare work studying the behavior difference between supervised learning and RL in a specific scenario. In the kart driving case in this work, the proposed learned agent shows stronger biological plausible learning capability than the supervised learned agent, in respect to dealing with specific situations and its adaptability.

One supervised learning-based study looked at the simulated self-driving game (Ho et al., 2017). However, three problems existed in their implementation. First, they created a handcrafted dataset. Obviously, one can never create this ideal benchmark dataset that includes all the bad situations encountered by the driving agent during training. At best, one can include the best behavior that the driving agent should implement in each step. The driving agent was reported to perform well when it had a good position in the driveway. However, the behavior deteriorated rapidly when the driving agent deviated from the driveway. Such behaviors indicated the dissimilar distribution and instability even though correctional measures were taken on the dataset. Second, they trained and supervised their network in a supervised way. As there are many possible scenarios, manually tackling all possible cases using supervised learning methods will likely yield a more simplistic policy (Shalev-Shwartz et al., 2016). Third, their experiments were built on ideal conditions; for example, they assumed that the brakes were ignored. In our experiments, we take the brakes into consideration. Moreover, to support autonomous capabilities, a robotic driven agent should adopt human driving negotiation skills when braking, taking left and right turns, and pushing ahead in unstructured roadways. It comes naturally that a trial-and-error way of study is more suitable for this simulated self-driving game. Hence, the bioinspired reinforcement learning method in the study is a more suitable way for the driving agent to learn how to make decisions.

In our study, we proposed a deep recurrent reinforcement learning network to solve simulated self-driving problems. Rather than creating a handcrafted dataset and training in a supervised way, we adopted a bio-inspired trail-and-error technique for the driving agent to learn how to make decisions. Furthermore, this paper provides three innovations. First, intermediate game parameters were completely abandoned, and the driving agent relied on only raw screen outputs. Second, a weighting layer was introduced in the network architecture to strengthen the intermediate effect. Third, a simple but effective experience recall mechanism was applied to deal with the sparse and lengthy episode.

The rest of this study is organized as follows: section 2 describes deep Q-learning, recurrent reinforcement learning, network architecture, and implementation details. Section 3 verifies experimental results. The conclusion of this study is drawn in section 4.

#### 2. METHODOLOGY

Deep Q-learning is used to help AI agents operate in environments with discrete actions spaces. Based on the knowledge of Deep Q-learning, we proposed a modified DRQN model in order to infer the full state in partially observable environments.

#### 2.1. Deep Q-Learning

Reinforcement learning manages learning policies for an agent interacting in an unknown environment. In each step, an agent observes the current states of the environment, makes decisions according to a policy π, and observes a reward signal r<sup>t</sup> (Lample and Chaplot, 2017). Given the current states and a set of available actions, the main aim of the DQN is to approximate the maximum sum of discounted rewards. According to the Bellman equation, it gives the approximating form of Q-values by combining the reward obtained with the current state-action pair and the highest Q-value at the next state st+1, and the best action a′:

$$Q(s\_t, a\_t) \leftarrow r\_t + \mathcal{y} \ast \arg\max\_{a'} Q\left(s\_{t+1}, a'\right) \tag{1}$$

We often use the form involving an iterative process:

$$Q(\mathbf{s}\_t, a\_t) \leftarrow Q(\mathbf{s}\_t, a\_t) + \alpha \left( r\_t + \boldsymbol{\nu} \ast \arg \max\_{a'} Q\left(\mathbf{s}\_{t+1}, a'\right) - Q(\mathbf{s}\_t, a\_t) \right) \tag{2}$$

In the assignments above, α stands for the learning rate and γ stands for the discounted factor.

The agent chooses the action following a ε-greedy exploration policy. The value of ε ranges from 0.0 to 1.0. In order to encourage the agent to explore the environment, the ε was set to 1.0 at first. During the training process, the value decreased gradually as the experience accumulated. Then, the agent could use experience to complete the task.

When we sample a sequence (s<sup>t</sup> , a<sup>t</sup> ,rt ,st+1) from the replay memory unit, the target value y<sup>t</sup> is calculated as:

$$\nu\_t = \begin{cases} r\_t & \text{for terminal } s\_{t+1} \\ r\_t + \nu \text{ arg}\max\_{a'} Q\left(s\_{t+1}, a'|\theta\right) & \text{for non-terminal } s\_{t+1} \end{cases} \tag{3}$$

The network was trained to approximate the expected Q-value, which led to the loss function, with parameters θ in the model:

$$\text{Loss}(\theta) = \sum \left( \mathbf{y}\_t - Q(\mathbf{s}\_t, a\_t | \theta) \right)^2 \tag{4}$$

#### 2.2. Recurrent Reinforcement Learning

For some special games which are three-dimensional and partially observable, the DQN lacks the ability to solve the problem. In partially observable environments, the agent only receives an observation o<sup>t</sup> of the current environment, which is usually insufficient to infer the full state of the system. The real state s<sup>t</sup> is the combination of the current observation o<sup>t</sup> and an unfixed length of history states. Hence, we adopted the DRQN model on top of the DQN to deal with such conditions (see **Figure 1**). The last fully connected layer was replaced by the LSTM in the DRQN model in order to record former information. **Figure 2** shows the sequential updates in the recurrent network. When updating the DRQN model, a sequence S was randomly sampled from the replay memory unit, and the beginning time step t was also randomly chosen according to the maximum length l. Then the cut sequence St,t+1,...,l−1,<sup>l</sup> was sent to the DRQN model. An additional input ht−<sup>1</sup> standing for the previous information was added to the recurrent model. At the zero time step, ht−<sup>1</sup> was set to zero. The output of the LSTM z ot , ht−<sup>1</sup> , which combined the current observation o<sup>t</sup> and the history information ht−1, was used to approximate the Q-value Q ot , ht−1, a<sup>t</sup> . The history information was updated and passed through the hidden state to the network in the next time step:

$$h\_t = LSTM\left(h\_{t-1}, o\_t\right) \tag{5}$$

#### 2.3. Network Architecture

In the beginning, we used the baseline DRQN model (Lample and Chaplot, 2017) to make an agent perform self-driving. However, we obtained unsatisfied results with the same model. Hence, we have made three improvements in our modified model to make things work better. The whole architecture is shown in **Figure 3**. First, the network was built on top of the NVIDIA's autopilot model (Krizhevsky et al., 2012). To reduce overfitting, the original model was modified by adding several batch normalization layers. We used a four-layer stronger CNN for feature extraction. The input size was resized to 320∗240. The first convolutional layer contained 32 kernels, with a size size of 8∗8 and a stride of 4. The second convolutional layer contained 64 kernels, with a size of 4∗4 and a stride of 4. The third layer contained 128 kernels, with a size of 3∗3 and a stride of 1. The last convolutional layer contained 256 kernels with a size of 7∗7 and a stride of 1. Relu was used as the activated function in the network, and the sizes of the pooling layers were all 2∗2. Second, we abandoned the fully connected layers before the LSTM layer in the original DRQN model and fed the LSTM directly with the

FIGURE 1 | The modified DRQN model. The value function was divided into two categories: the current value function *Q* and target value function *Q*′. The parameters in *<sup>Q</sup>* were assigned to *<sup>Q</sup>*′ per N episodes. The state contained two elements: *<sup>o</sup><sup>t</sup>* gained from the current environment and *<sup>h</sup>t*−<sup>1</sup> gained from former information. The agent performed action *a* using a specific policy, and the sequence (*ot* , *a*, *r*, *ot* ′) was stored in the replay memory unit. We used a prioritized experience replay memory unit here. During training, the sequence was randomly chosen from the replay memory unit. We trained the network using gradient descent to make the current value function *Q* approach *Q*′ given a specific sequence. The loss function was shown in Equation (4).

high-level feature. The number of units in LSTM was set to 900. Third, the subsequent structure was divided into two groups for different purposes. One was mapped to the set of possible actions, and the other was a set of scalar values. The final action value function was value function was acquired using both of them. We will introduce their functions respectively.

We used two collateral layers rather than a single DRQN network to approximate the value function. The auxiliary layer also had a five-dimensional output as the action space layer. The original target was to balance the impact of the current and history. The agent could make a suitable action with a fully observing perspective. Nevertheless, we wanted to focus on the precise instantaneous changes of the current scene. The output of the auxiliary layer was mapped to [0, 1.0] using the softmax function, thus suggesting the correction for the raw approximations using the DRQN model. Because of this intervention, the network would not only learn the best action for a state but also understand the significance of taking actions. If an agent drove in a straight line and had an obstacle far ahead, an original DRQN model could learn that it was time to make

the driving agent move a bit to avoid hitting the obstacle. The modified model would also have insight into when it was the best time to move, with the knowledge that the danger of the obstacle increases as it gets closer. V(s, a) was used to represent the original output of the DRQN, and I(s, a) was used to represent the importance provided by the auxiliary layer. We used the formula to express the final strengthen of the Q-value (see **Figure 3**):

$$Q(s, a) = V(s, a)^T \* I(s, a) \tag{6}$$

The result was stored in the prioritized experience memory unit. During training, the sequence in the memory unit was removed, and we used Equation (4) to calculate the loss.

## 2.4. Implementation Details

Reinforcement learning consists of two basic concepts: action and reward. Action is what an agent can do in each state. Given that the screen is the input, a robot can take steps within a certain distance. An agent can take finite actions. When a robot takes an action in a state, it receives a reward. Here, the term reward is an abstract concept that describes the feedback from the environment. A reward can be positive or negative. When the reward is positive, it corresponds to our normal meaning of reward. However, when the reward is negative, it corresponds to what we usually call punishment. We also describe the training details such as the hyperparameters, input size selection, frameskip, and prioritized replay.

#### 2.4.1. Action Space

The agent could perform five actions, including Left, Right, Straight, Brake, and Backwards. The range of the joystick reflects the numerical value of the speed control, which is mapped into a region of –80 to 80. The speed section was discretized into a set of [0, 20, 40, 80] for speed control. We considered that only the turning control had a high requirement for precision to keep the model simple. Another three actions, including forward flag, backward flag and brake, were represented by 1/0 flag. Thus, we used 5-dimensional vectors to represent each action as shown below:

$$\begin{aligned} actions &= [40, 0, 1, 0, 0], \text{ } & \text{left} \\ &= [-40, 0, 1, 0, 0], \text{ } & \text{right} \\ &= [0, -80, 0, 1, 0], \text{ } & \text{\$go\$} \text{ } & \text{\$mark\$} \text{wards} \\ &= [0, 0, 1, 0, 0], \text{ } & \text{\$go\$} \text{ } & \text{\$mark\$} \\ &= [0, 0, 0, 0, 1], \text{ } & \text{\$brake} \end{aligned}$$

The meaning of each dimension in the vector represented forward speed, backward speed, backward flag, forward flag, and brake. For forward speed, a positive value indicates to the left and a negative value indicates to the right. When the backward flag was set to 1, the agent would move backwards at a specific speed.

#### 2.4.2. Reward

Mnih et al. (2015) states that end-to-end human-level RL control draws on neurobiological evidence that reward signals during perceptual learning may influence the characteristics of representations within the primate visual cortex. The AI trained in a supervised way would only respond to visual information. If the kart picks the wrong direction, it would likely drive straight since the scenes of the correct and wrong direction are mostly the same. In other words, the supervised learned agent does not understand risky situations that are likely to lead into error states during real-time play. In contrast, an agent receiving reward and punishment signals can avoid the noted situation efficiently. Under most circumstances, the driving agent cannot explore a path with big rewards initially. The driving agent often gets stuck somewhere in halfway through and waits for the time to elapse before resetting. We have to make the rewards of these cases variant in order to make these experiences meaningful. Hence, we set a series of checkpoints along the track. A periodical reward was given to the driving agent when each checkpoint was reached. The closer

the wall element in the original graph. That means the agent could pay attention to the wall and then take suitable actions. Panel (A) shows the direct view visualization of the first layer of CNN. Panel (B) shows the visualization of the last layer of CNN using deconvolution.

the distance between the checkpoint and the destination, the bigger the phased reward was given. We have established a more precise reward system to increase density, such as giving the driving agent a slight punishment when it moves backwards. At each step, the agent will also get a little punishment. The detailed component of the reward will be introduced in section 2.4.4.

#### 2.4.3. Prioritized Replay

Hassabis et al. (2017) states that experience replay was directly inspired by theories that seek to understand how the memory systems in the mammalian brain might interact. According to the biological plausibility mentioned in Hassabis et al. (2017) and Schaul et al. (2016), we modified the original replay mechanism. In most cases, there was little replay memory with high rewards, which would be timeconsuming with a huge replay table and many sparse rewards. As prioritized replay was a method that can make learning from experience replay more efficient, we simply chose the important experiences in proportion to their rewards and stored them into the replay memory. That way, memories with high rewards would have a greater opportunity to be recalled.

#### 2.4.4. Hyperparameters

The original screen outputs were three channel RGB images. They were first transformed into gray-scale images and then fed to the network to train. The network was trained using the RMSProp algorithm. The size of minibatch was set to 40. The size of replay memory was set to contain 10,000 recent frames. The learning rate α was set to 1.0 in the beginning and followed a linear degradation and finally was fixed at 0.1. The exploration rate ε was set to 0 when we evaluated the model. When the agent finished the game, it would get 1,000 scores as reward. Whenever it crossed each lap, it would get 100 scores. The agent would get 0.5 scores as it got to a checkpoint. If the agent moved backwards, it would get –0.5 scores as punishment. At each time step, it would get –0.1 scores as punishment because we wished the agent to finish the game as quickly as possible.

#### 2.4.5. Other Details

To accelerate training and save running memory, the original 640∗480 screen resolution was resized to 160∗120 in the beginning. After several hours of trials, the driving agent still got stuck in most cases and could not complete one lap. The resulting rewards oscillated for not finishing the game in the limited number of steps, thus indicating the resolution was too low for the model to recognize. To enrich the visual information and address the above problems, the input size was set to 320∗240. We also attempted to reduce the punishments to encourage positive rewards. After the observation of the same length of time, the distribution of the resulting rewards became steady and started to turn positive over the baseline. Hence, the input size of the resolution was finally set to 320∗240 in the experiment in spite of the memory consumption. The system started to learn successfully within the acceptable limit of time.

Since slight change occurs between adjacent frames, we utilized the frame-skip technique (Bellemare et al., 2013). We took out one frame as the network input by every k + 1 frame, and the same action was repeated over the skipped frames. When k became higher, the training speed became high as well. However, the information the agent got became imprecise as well. In order to achieve the balance of low computing resource consumption and smooth control, we finally choose a frame skip of k = 3 by relying on our experience.

## 3. EXPERIMENTS

The model was trained using three different tracks, which cover all the track that Stanford used for comparison. An individual set of weights was trained separately for each model because each track has different terrain textures. The rewards were low initially because it is equivalent to a random exploration at the beginning of training and because the driving agent would get stuck somewhere without making any significant progress. After about 1,400 episodes of training, the driving agent finished the race. Under most circumstances, the driving agent did not finish the race in given steps so the reward was positive but not as high as receiving the final big reward. We set this step limit because of a lack of a reset mechanism for dead situations, which was very useful in the early stage of training. In Stanford's report, they created a DNF flag to represent the driving agent getting stuck. In our experiment, the driving agent had learned better policy and displayed better behaviors, proving better robustness of the system. We also visualized the CNN in order to validate the ability of the model.

#### 3.1. Experimental Environment

We chose the car racing game to carry out our simulated self-driving experiment. In order to play the car racing game autonomously, we used a third-party OpenAI Gym environment wrapper for the car racing game developed by Bzier<sup>1</sup> . With the assistance of the API, we accessed the game engine directly and ran our code while playing the game frame-by-frame. By means of the API, we can easily get the game information, whether it is screen output intermediate game parameters, such as its location in the small map. Our models proved to efficiently handle the observable gaming environments. To demonstrate the agent's driving status, we include a Youtube link at https://youtu.be/KV-hh8N5x3M.

#### 3.2. Rewards Analysis

For comparison, the model was trained and tested using the same tracks like those used by Stanford in their supervised learning method. **Figure 4** shows the rewards trends for two different maps. From the rewards trends we can observe that at about 400 episodes, the stability of the driving agent began to increase. The reward stabilized at a high level after 1400 episodes, where the agent performed a good driving behavior and finished the tracks well. Each track has different terrain textures and difficulty routes. Therefore, an individual set of weights was

<sup>1</sup>https://github.com/bzier/gym-mupen64plus

TABLE 1 | Performance comparison.


*For each track, we run 10 races in real time and calculate the mean race times as the final result. The Stanford results were borrowed from the Stanford report. Human results are obtained by real human participants playing each track twice: The first time is to get used to the track, and the second time is to record the time they finish the game. DNF*\* *signifies that the autopilot got stuck and was unable to finish some number of races.*

TABLE 2 | System comparison between our work and Stanford's supervised model.


trained separately for each model. The experiments were carried out on a common configured portable laptop, and all models converged after spending over 80 h each.

#### 3.3. CNN Visualization

CNN layers were used to extract abundant information in the scene, and the result of the high-level feature was the critical measurement of the training process. In the traditional supervised learning, the network could be evaluated from many methods such as the validation accuracy and the loss function. However, in the reinforcement learning, we did not have this kind of method to provide a quantitative assessment of the model. Hence, we visualized the output of the first layer and the last layer (see **Figure 5**), with the aim of ascertaining the quantitative features that are captured by the network. The highlevel low-level layer output seemed quite abstract through direct observation. Thus, we visualized the high-level features through deconvolution. Through the visualization procedure, we were sure that the network could capture the important element in the scene.

#### 3.4. Result and Discussion

The test results are shown in **Table 1**. The driving agent was tested on several tracks in line with those of Stanford's experiment such as Farm, Raceway, and Mountain. We evaluated our model based on the time. For each track, we ran 10 races in real-time and calculated the mean race times as the final result. The human results (Ho et al., 2017) and the Stanford results were borrowed from the Stanford report. Human results are obtained by real human participants playing each track two times: The first time is to get used to the track, and the second time is to record the time they finish the game.

The proposed model shows some advantages via a comprehensive comparison in **Table 2**. Firstly, some of the actions such as braking and going backwards are important in the driving kart scenario. Stanford's paper reported that the agent is unable to handle situations where the agent may have to turn around or drive backward, and thus would lead to getting stuck. Secondly, their AI is more sensitive to positive visual information. If the kart picks the wrong direction, it would likely drive straight since the visual scenes of the true and wrong direction are mostly the same. In other words, their AI does not understand risky situations that are likely to lead to error states during real-time play. Thirdly, the Stanford model was trained in a handcrafted dataset collected from 18,658 training examples across four tracks, three of which were also used for testing. If they want to generalize their model to other tracks, they need to collect new data and annotations, which might be expensive and unfeasible. Intuitively, our proposed model attempts to learn actions by trial and error without a huge amount of labels and handcrafted datasets.

By analyzing the route the agent runs, we found that the route was not as smooth as that in Stanford's experiment, for which there are two reasons. On one hand, ǫ decayed too fast. As shown in **Figure 5**, the value of ǫ rapidly decreased to 0.1 while the rewards increased. Then, during the latter phase of training, the agent depended mainly on experiences even though there were still many better state-action sets to explore. On the other hand, the actions was discretized roughly. We used a set of [0; 20; 40; 80] as the choices for speed. Through observation, speeds of 20 and 40 both produce a tiny effect while a speed of 80 would make a radical change. More efforts should put on the selection of numerical value of the joystick parameter. However, compared with the experiment done by the Stanford group, our experiments performed well even if the driving agent deviated from the driveway. We considered the brake and trained the driving agent using a trialand-error method, which was more in line with the real situation. Hence, the bio-inspired reinforcement learning method in the study was a more suitable approach for the driving agent to make decisions.

## 4. CONCLUSION

Brain-inspired learning has recently gained additional interest in solving control and decision-making tasks. In this paper, we propose an effective brain-inspired end-to-end learning method with the aim of controlling the simulated self-driving agent. Our modified DRQN model has proven to manage plenty of error states effectively, thus indicating that our trial-and-error method using deep recurrent reinforcement learning could achieve better performance and stability. By using the screen pixels as the only input of the system, our method highly resembles the experience of human beings solving a navigation task from the first-person perspective. This resemblance makes this research inspirational for realworld robotics applications. Hopefully, the proposed braininspired learning system will inspire real-world self-driving control solutions.

## DATA AVAILABILITY

No datasets were generated or analyzed for this study.

#### REFERENCES


## AUTHOR CONTRIBUTIONS

JieC, JinC, RZ, and XH carried out the conception and design of the study, the analysis and interpretation of the data, and drafted and revised the article.

## FUNDING

This work was financially supported by the German Research Foundation (DFG) and the Technical University of Munich (TUM) in the framework of the Open Access Publishing Program. This research was also funded by the Chinese Ministry of Education's National University Student Innovation and Entrepreneurship Training Program (2018).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Chen, Chen, Zhang and Hu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Deep Recurrent Neural Networks Based Obstacle Avoidance Control for Redundant Manipulators

Zhihao Xu<sup>1</sup> , Xuefeng Zhou<sup>1</sup> and Shuai Li <sup>2</sup> \*

*<sup>1</sup> Guangdong Key Laboratory of Modern Control Technology, Guangdong Institute of Intelligence Manufacturing, Guangzhou, China, <sup>2</sup> School of Engineering, Swansea University, Swansea, United Kingdom*

Obstacle avoidance is an important subject in the control of robot manipulators, but is remains challenging for robots with redundant degrees of freedom, especially when there exist complex physical constraints. In this paper, we propose a novel controller based on deep recurrent neural networks. By abstracting robots and obstacles into critical point sets respectively, the distance between the robot and obstacles can be described in a simpler way, then the obstacle avoidance strategy is established in form of inequality constraints by general class-K functions. Using minimal-velocity-norm (MVN) scheme, the control problem is formulated as a quadratic-programming case under multiple constraints. Then a deep recurrent neural network considering system models is established to solve the QP problem online. Theoretical conduction and numerical simulations show that the controller is capable of avoiding static or dynamic obstacles, while tracking the predefined trajectories under physical constraints.

#### Edited by:

*Changhong Fu, Tongji University, China*

#### Reviewed by:

*Haifei Zhu, Guangdong University of Technology, China Yongping Pan, National University of Singapore, Singapore*

#### \*Correspondence:

*Shuai Li shuaili@ieee.org*

Received: *15 April 2019* Accepted: *17 June 2019* Published: *04 July 2019*

#### Citation:

*Xu Z, Zhou X and Li S (2019) Deep Recurrent Neural Networks Based Obstacle Avoidance Control for Redundant Manipulators. Front. Neurorobot. 13:47. doi: 10.3389/fnbot.2019.00047* Keywords: recurrent neural network, redundant manipulator, obstacle avoidance, zeroing neural network, motion plan

## 1. INTRODUCTION

As industrial automation develops, robot manipulators have been used in a wide range of applications such as painting, welding, assembly, etc., (Cheng et al., 2009; Yang et al., 2018a). With the evolution of intelligent manufacturing, the way robot works is also changing. In order to fulfill more difficult tasks in complex environment, the robot is required to have better execution capabilities (Pan et al., 2018). Therefore, robots with redundant DOFs have attracted much attention in the field of robotic control since its wonderful flexibility (Chan and Dubey, 1995; Zhang, 2015).

Obstacle avoidance is a core problem in the control of redundant manipulators, in order to realize human-machine collaboration and integration, robots no longer work in a separate environment that is completely isolated (Ge and Cui, 2000; Sugie et al., 2003; Lee and Buss, 2007). Instead, collaboration is required between human or other robots, as a result, the obstacle avoidance control is becoming a matter of urgency: robots need to achieve real-time avoidance of static or dynamic obstacles while completing given motion tasks.

Many obstacle avoidance methods for robot manipulators haven been reported, which are designed online or off-line. Based on stochastic sampling algorithm, a series of obstacle avoidance methods are proposed, these methods could obtain effective solutions even in ultra-redundant systems. In Wei and Ren (2018), Wei et al. propose a modified RRT based method, namely Smoothly RRT, in which a maximum curvature constraint is built to obtain a smooth curve when avoiding obstacles, simulation results also show that the method achieves faster convergence than traditional RRT based ones. In Hsu et al. (2006), Hsu discusses the probabilistic foundations of PRM based methods, a conclusion is drew that the visibility properties rather than dimensionality of C has a greater impact on the probability, and the convergence would be faster if extract partial knowledge could be introduced. However, due to the large computational costs, those methods can be hardly used online.

Different from stochastic results obtained by above mentioned methods, artificial potential field methods plan the same path each time in the same environment, which is important in industrial applications (Khatib, 1986). The basic idea of artificial potential field methods is that the target bears as an attractive pole while the obstacle creates repulsion on the robot, then the robot will be controlled to converge to the target without colliding with obstacles. At the same time, artificial potential field methods have shown great ability in tracking dynamic targets as well as avoiding dynamic obstacles. In Csiszar et al. (2011), a modified method is proposed, which describes the obstacles by different geometrical forms, both theoretical conduction and experimental tests validate the proposed method. Considering the local minimum problem that may be caused by multi-link structures, in Badawy (2016), a two minima is introduced to construct potential field, such that a dual attraction between links enables faster maneuvers comparing with traditional methods. Other improvements to artificial potential field method can be found in Tsai et al. (2001); Tsuji et al. (2002); Wen et al. (2017). Taking advantage of redundant DOFs, obstacles can be avoided by the self-motion in the null space, by calculating pseudoinverse of Jacobian matrix, the solution can be formulated as the sum of a minimum-norm particular solution and homogeneous solutions (Cao et al., 1999; Moosavian and Papadopoulos, 2007; Krzysztof and Joanna, 2016).

The application of artificial intelligence algorithms based on neural networks provide a new idea for robotic control, these methods are considered to be very promising since its excellent learning ability (Jung and Kim, 2007). For instance, in Pan et al. (2017), the authors propose a command-filtered back-stepping method, in which a neural network based learning scheme is introduced to deal with functional uncertainties. In Pan and Yu (2017), a biomimetic hybrid controller is established, in which the control strategy consist of a feed-forward predictive machine based on a RBF Neural Network and a feedback servo machine based on a proportional-derivative controller. In Fu et al. (2018), a fuzzy logic controller is proposed for long-term navigation of quad-rotor UAV systems with input uncertainties. Experiment results show that the controller can achieve better control performance when compared to their singleton counterparts. In Fu et al. (2019), an online learning mechanism is built for visual tracking systems. The controller uses both positive and negative sample importances as input, and it is shown that the proposed weighted multiple instance learning scheme achieves wonderful tracking performance in challenging environments. Typically, the structure of a neural network may be complex in order to achieve better performance. Although the model of robot manipulator is highly nonlinear, by introducing the priori information of the system model, the neural network can be optimized, i.e., the number of nodes in neural networks can be reduced effectively while maintaining the learning efficiency (Fontaine and Germain, 2001). Inspired by this, a series dynamic neural networks are proposed to realize robotic control in realtime (Zhang et al., 2004; Li et al., 2017; Yang et al., 2018b). Based on the idea of constraint-optimization, quadraticprogramming approaches haven been introduced for kinematic control of redundant manipulators. The designed outer-loop controller is described as equality constraints, and objective functions are established to describe certain performance of the system. Using the learning and parallel calculation ability, dynamic neural networks are established to solve the quadraticprogramming problem online. The kinematic control is thus achieved by ensuring the equality constraints, and the flexibility is used by optimizing the objective functions. On the other hand, these methods is capable of handling inequality constraints and model uncertainties (Zhang et al., 2018; Li et al., 2019; Xu et al., 2019b). In Cheng et al. (1993), the obstacle avoidance strategy is described as equality constraints, but the parameters of escape velocity is difficult to obtain. In Zhang and Wang (2004), Zhang et al. propose an inequality based method, in which the distance between the robot and obstacles are formulated as a group of distances from critical points and robot links. On this basis, an improved method is proposed by Guo et al. in Guo and Zhang (2012), which is capable of suppressing undesirable discontinuity in the original solutions.

Motivated by the above observations, in this paper, we proposed a novel obstacle avoidance strategy based on deep recurrent neural networks. By abstracting robot and obstacles as a set of critical points, the distances between the robot and obstacles are approximately described by a group of pointto-point distances. And the obstacle avoidance is realized by inequality constraint described by class-K functions. Then the obstacle avoidance problem is reformulated as a QP problem in the speed level, and a deep recurrent neural network is designed to solve the QP online. Numerical results show that the robot is capable of avoiding the obstacles while tracking the predefined trajectories. Before ending this section, the main contributions of this paper are summarized as below


## 2. PROBLEM FORMULATION

## 2.1. Basic Description

When a redundant robot is controlled to track a particular trajectory in the cartesian space, the positional description of the

$$
\alpha = f(\theta),
\tag{1}
$$

where x ∈ R <sup>m</sup> and θ ∈ R n are the end-effector′ s positional vector and joint angles, respectively. In the velocity level, the kinematic mapping between x˙ and θ˙ can be described as:

$$
\dot{\mathfrak{x}} = J(\theta)\dot{\theta}, \tag{2}
$$

where J(θ) ∈ R m×n is the Jacobian matrix from the end-effector to joint space.

In engineering applications, obstacles are inevitable in the workspace of a robot manipulator. For example, robot manipulators usually work in a limited workspace restricted by fences, which are used to isolated robots from humans or other robots. This problem could be even more acute in tasks which requires collaboration of multiple robots. Let C<sup>1</sup> be the set of all the points on a robot body, and C<sup>2</sup> be the set of all the points on the obstacles, then the purpose of obstacle avoidance of a robot manipulator is to ensure C<sup>1</sup> ∪C<sup>2</sup> = ∅ at all times. By introducing d as a safety distance between the robot and obstacles, the obstacle avoidance is reformulated as

$$|O\_j A\_i| \ge d, \qquad \forall A\_i \in C\_1, \forall O\_i \in C\_2. \tag{3}$$

where |OjA<sup>i</sup> | = q (A<sup>i</sup> − Oj) <sup>T</sup>(A<sup>i</sup> − Oj) is the Euclidean norm of the vector AiO<sup>j</sup> .

Equation (3) gives a basic description of obstacle avoidance problem in form of inequalities. However, there are too many elements in sets C<sup>1</sup> and C2, the vast majority of which are actually unnecessary. Therefore, by uniformly selecting points of representative significance from C<sup>1</sup> and C2, and increasing d properly, Equation (3) can be approximately described as below:

$$|O\_{j}A\_{i}| \geq d,\tag{4}$$

with A<sup>i</sup> , i = 1, . . . , a and O<sup>j</sup> , j = 1, . . . , b being the representative points of the robot and obstacles, respectively. The schematic diagram of Equation (4) in shown in **Figure 1**.

Remark. 1 In real implementations, there are many ways to measure |OjA<sup>i</sup> |. For instance, since physical structure of

the a manipulator is known, the key points A<sup>i</sup> are available in advance, both positions and velocities of those points can be calculated directly using the feedback of robot joints. The real-time measurement of obstacles can be achieved through industrial cameras. Therefore, the information of A<sup>i</sup> and B<sup>j</sup> are all available. As to measurement noise, by introducing a bigger safety distance d, e.g., d = 1.5(d<sup>1</sup> +d2), the safety can be ensured.

#### 2.2. Reformulation of Inequality in Speed Level

In order to guarantee the inequality (4), by defining D = |OjA<sup>i</sup> |− d, an inequality is rebuilt in speed level as:

$$\text{d}(|O\_j A\_i|)/\text{dt} \succeq -\text{sgn}(D)\text{g}(|D|),\tag{5}$$

in which g(•) belongs to class-K. Remarkable that the velocities of critical points A<sup>i</sup> can be described by joint velocities:

$$
\dot{A}\_i = J\_{ai}(\theta)\dot{\theta},\tag{6}
$$

where Jai ∈ R m×n is the Jacobian matrix from the critical point A<sup>i</sup> to joint space. If O<sup>j</sup> is prior known, the left-side of Equation (5) can be reformulated as:

$$\begin{split} \frac{d}{dt} \langle |O\_{j}A\_{i}| \rangle &= \frac{d}{dt} (\sqrt{(A\_{i} - O\_{j})^{\mathrm{T}}(A\_{i} - O\_{j})}) \\ &= \frac{1}{|O\_{j}A\_{i}|} (A\_{i} - O\_{j})^{\mathrm{T}} \langle \dot{A}\_{i} - \dot{O}\_{j} \rangle \\ &= |\overrightarrow{O\_{j}A\_{i}}|^{\mathrm{T}} J\_{ai}(\theta)\dot{\theta} - |\overrightarrow{O\_{j}A\_{i}}|^{\mathrm{T}} \dot{O}\_{j}, \end{split} \tag{7}$$

where −−−→ <sup>|</sup>OjA<sup>i</sup> | = (A<sup>i</sup> − Oj) T /|OjA<sup>i</sup> | ∈ R <sup>1</sup>×<sup>m</sup> is the unit vector of −−−−→ <sup>A</sup><sup>i</sup> <sup>−</sup> <sup>O</sup><sup>j</sup> . Therefore, the collision between critical point A<sup>i</sup> and object O<sup>j</sup> can be obtained by ensuring the following inequality:

$$J\_{ol}\dot{\theta} \le \text{sgn}(D)\text{g}(|D|) - |\overrightarrow{O\_jA\_i}|^T \dot{O}\_j,\tag{8}$$

where Joi = − −−−→ <sup>|</sup>OjA<sup>i</sup> | T Jai ∈ R 1×n . Based on the inequality description (8), the collision between robot and obstacle can be avoided by ensuring:

$$J\_o \dot{\theta} \le B,\tag{9}$$

where J<sup>o</sup> = [J T o1 , · · · , J T o1 | {z } b , · · · , J T oa, · · · , J T oa | {z } b ] <sup>T</sup> ∈ R ab×n is the concatenated form of Joi considering all pairs between A<sup>i</sup> and Oj , B = [B11, · · · , B1<sup>b</sup> , · · · , Ba1, · · · , Bab] <sup>T</sup> ∈ R ab is the vector −−−→ <sup>|</sup>OjA<sup>i</sup> TO˙

of upper-bounds, in which Bij = sgn(D)g(|D|) − | j . Remark 2: According to 5 the definition of class-K functions, the original escape velocity based obstacle avoidance methods in Zhang and Wang (2004); Guo and Zhang (2012) can be regarded as a special case of 5, in which G(|D|) is selected as G(|D|) = k|D|. Therefore, in this paper, the proposed obstacle avoidance strategy is more general than traditional methods.

#### 2.3. QP Type Problem Description

As to redundant manipulators, in order to take full advantage of the redundant DOFs, the robot can be designed to fulfill a secondary task when tracking a desired trajectory. In this paper, the secondary task is set to minimize joint velocity while avoiding obstacles. In real implementations, both joint angles and velocities are limited because of physical limitations such as mechanical constraints and actuator saturation. Because of the fact that rank (J) < n, there might be infinity solutions to achieve kinematic control. In this paper, we aim to design a kinematic controller which is capable of avoiding obstacles while tracking a pre-defined trajectory in the cartesian space. For safety′ s sake, the robot is wished to move at a low speed, on the other hand, lower energy consumption is guaranteed. By defining an objective function scaling joint velocity as θ˙<sup>T</sup> θ/˙ 2, the tracking control of a redundant manipulator while avoiding obstacles can be described as:

$$\min \quad \theta^{\mathrm{T}} \theta / 2,\tag{10a}$$

$$\text{s.t.} \quad \mathfrak{x} = \mathfrak{x}\_{\mathsf{d}}, \tag{10b}$$

$$
\theta^- \le \theta \le \theta^+,\tag{10c}
$$

θ˙<sup>−</sup> ≤ θ˙ ≤ θ˙+, (10d)

$$J\_o \dot{\theta} \le B.\tag{10e}$$

It is remarkable that the constraints 10b–10e and the objective function 10a to be optimized are built in different levels, which is very difficult to solve directly. Therefore, we will transform the original QP problem (10) in the velocity level. In order to realize precise tracking control to the desired trajectory xd, by introducing a negative feedback in the outer-loop, the equality constraint can be ensured by letting the end-effector moves at a velocity of x˙ = ˙x<sup>d</sup> + k(x<sup>d</sup> − x). In terms with (10c), according to escape velocity method, it can be obtained by limiting joint speed as α(θ <sup>−</sup> − θ) ≤ θ˙ ≤ α(θ <sup>+</sup> − θ), where α is a positive constant. Combing the kinematic property (2), the reformulated QP problem is:

$$\min \quad \dot{\theta}^{\mathrm{T}} \dot{\theta} / 2,\tag{11a}$$

$$\text{s.t.}\quad J(\theta)\dot{\theta} = \dot{\mathbf{x}}\_{\text{d}} + k(\mathbf{x}\_{\text{d}} - \mathbf{x}),\tag{11b}$$

$$\max(\alpha(\theta^- - \theta), \dot{\theta}^-) \le \dot{\theta} \le \min(\dot{\theta}^+, \alpha(\theta^+ - \theta)), \tag{11c}$$

$$J\_o \dot{\theta} \le B.\tag{11d}$$

It is noteworthy that both the formula (11a) and (11d) are nonlinear. The QP problem cannot be solved directly by traditional methods. Using the parallel computing and learning ability, a deep RNN will be established later.

#### 3. DEEP RNN BASED SOLVER DESIGN

In this section, a deep RNN is proposed to solve the obstacle avoidance problem (11) online. To ensure the constraints (11b), (11c), and (11d), a group of state variables are introduced in the deep RNN. The stability is also proved by Lyapunov theory.

#### 3.1. Deep RNN Design

Firstly, by defining a group of state variables λ<sup>1</sup> ∈ R <sup>m</sup>, λ<sup>2</sup> ∈ R ab , a Lagrange function is selected as:

$$L = \dot{\theta}^{\text{T}} \dot{\theta} / 2 + \lambda\_1^{\text{T}} (\dot{\mathbf{x}}\_{\text{d}} + k(\mathbf{x}\_{\text{d}} - \mathbf{x}) - J(\theta)\dot{\theta}) + \lambda\_2^{\text{T}} (f\_o \dot{\theta} - B), \tag{12}$$

λ<sup>1</sup> and λ<sup>2</sup> are the dual variables corresponding to the constraints (11b) and (11d). According to Karush-Kuhn-Tucker conditions, the optimal solution of the optimization problem (12) can be equivalently formulated as:

$$
\dot{\theta} = P\_{\Omega} (\dot{\theta} - \frac{\partial L}{\partial \dot{\theta}}),
\tag{13a}
$$

$$J(\theta)\dot{\theta} = \dot{\mathbf{x}}\_{\mathrm{d}} + k(\mathbf{x}\_{\mathrm{d}} - \mathbf{x}),\tag{13b}$$

$$\begin{cases} \lambda\_2 > 0 & \text{if } \ J\_o \dot{\theta} = B, \\ \lambda\_2 = 0 & \text{if } \ J\_o \dot{\theta} \le B, \end{cases} \tag{13c}$$

where P(x) = argminy∈||y − x|| is a projection operator to a restricted interval , which is defined as = {θ˙ ∈ R n |max(α(θ <sup>−</sup> − θ), θ˙−) ≤ θ˙ ≤ min(θ˙+, α(θ <sup>+</sup> − θ))}. Notable that Equation (13c) can be simply described as:

$$
\lambda\_2 = (\lambda\_2 + J\_o \dot{\theta} - B)^+, \tag{14}
$$

where (•) <sup>+</sup> is a projection operation to the non-negative space, in the sense that y <sup>+</sup> = max(y, 0).

Although the solution of (13) is exact the optimal solution of the constrained-optimization problem (11), it is still a challenging issue to solve (13) online since its inherent nonlinearity. In this paper, in order to solve (13), a deep recurrent neural network is designed as:

$$\epsilon \dot{\theta} = -\dot{\theta} + P\_{\Omega} \langle \mathbf{J}^{\mathrm{T}} \lambda\_1 - \mathbf{J}\_o^{\mathrm{T}} \lambda\_2 \rangle,\tag{15a}$$

$$
\epsilon \dot{\lambda}\_1 = \dot{\varkappa}\_\mathrm{d} + k(\varkappa\_\mathrm{d} - \varkappa) - f(\theta)\dot{\theta}, \tag{15b}
$$

$$
\epsilon \dot{\lambda}\_2 = -\lambda\_2 + (\lambda\_2 + J\_o \dot{\theta} - B)^+, \tag{15c}
$$

with ǫ is a positive constant scaling the convergence of (15).

Remark. 3 As to the established deep RNN (15), the first dynamic equation is also the output dynamics, which combines the dynamics of state variables λ<sup>1</sup> and λ2, as well as the physical limitations including joint angles and velocities. State variable λ<sup>1</sup> is used to ensure the equality constraint (11b), as shown in (15b), λ<sup>1</sup> is updated according to the difference between reference speed x˙<sup>d</sup> + k(x<sup>d</sup> − x) and actually speed J(θ)θ˙. Similarly, λ<sup>2</sup> is used to ensure the inequality constraint 11d, which will be further discussed later. It is remarkable that ǫ plays an important role in the convergence of the deep RNN. The smaller ǫ, the faster the deep RNN converges.

Remark. 4 By introducing the model information such as J, Jo, etc., the established deep RNN consists of three class of nodes, namely θ˙, λ<sup>1</sup> and λ2, and the total number of nodes is n+m+ab. Comparing to traditional neural networks in Jung and Kim (2007), the complexity of neural networks is greatly reduced.

#### 3.2. Stability Analysis

In this subsection, we offer stability analysis of the obstacle avoidance method based on deep RNN based. First of all, some basic Lemmas are given as below.

Definition 1: A continuously differentiable function F(•) is said to be monotone, if ∇F+∇F T is positive semi-definite, where ∇F is the gradient of F(•).

Lemma 1: A dynamic neural network is said to converge to the equilibrium point if it satisfies:

$$\kappa \dot{\mathfrak{x}} = -\mathfrak{x} + P\_{\mathbb{S}}(\mathfrak{x} - \varrho F(\mathfrak{x})),\tag{16}$$

where κ > 0 and ̺ > 0 are constant parameters, and P<sup>S</sup> = argminy∈<sup>S</sup> ||y − x|| is a projection operator to closed set S.

Lemma 2: (Slotine and Li, 2004) Let V :[0,∞) × B<sup>d</sup> → R be a C 1 function, α1, α<sup>2</sup> be class-K functions defined on [0, d) which satisfy α1(||x||) ≤ V(t, x) ≤ α2(||x||), ∀(t, x) ∈ [0, d) × Bd, then x = 0 is a uniformly asymptotically stable equilibrium point if there exist some class-K function g defined on [0, d), to make the following inequality hold:

$$\frac{\partial V}{\partial t} + \frac{\partial V}{\partial \mathbf{x}} f(t, \mathbf{x}) \le -\alpha (||\mathbf{x}||), \forall (t, \mathbf{x}) \in [0, \infty) \times B\_d,\tag{17}$$

About the stability of the closed-loop system, we offer the following theorem.

Theorem 1: Given the obstacle avoidance problem for a redundant manipulator in kinematic control tasks, the proposed deep recurrent neural network is stable and will globally converge to the optimal solution of (10).

Proof: The stability analysis consists of two parts: firstly, we will show that the equilibrium of the deep RNN is exactly the optimal solution of the control objective described in (11), which prove that the output of deep RNN will realize obstacle avoidance while tracking a given trajectory, and then we will prove that the deep recurrent neural network is stable in sense of Lyapunov.

Part I. As to the deep recurrent neural network (15), let (θ˙<sup>∗</sup> , λ ∗ 1 , λ ∗ 2 ) be the equilibrium of the deep RNN, then (θ˙<sup>∗</sup> , λ ∗ 1 , λ ∗ 2 ) satisfies:

$$-\dot{\theta}^\* + P\_\Omega \mathbf{J}^\mathrm{T} \lambda\_1^\* - J\_o^\mathrm{T} \lambda\_2^\* \mathbf{j} = \mathbf{0},\tag{18a}$$

$$
\dot{x}\_{\rm d} + k(x\_{\rm d} - x) - f(\theta)\dot{\theta}^\* = 0,\tag{18b}
$$

$$-\lambda\_2^\* + (\lambda\_2^\* + J\_o \dot{\theta}^\* - B)^+ = 0,\tag{18c}$$

with θ˙<sup>∗</sup> be the output of deep RNN. By comparing Equation (18) and (13), we can readily obtain that they are identical, i.e., the equilibrium point satisfies the KKT condition of (10).

Then we will show that the equilibrium point(output of the proposed deep RNN) is capable of dealing with kinematic tracking as well as obstacle avoidance problem. Define a Lyapunov function V about the tracking error e = x<sup>d</sup> − x as V = e T e/2, by differentiating V with respect to time and combining (11b), we have:

$$\dot{V} = e^{\mathcal{T}} \dot{e} = e^{\mathcal{T}} (\dot{\mathbf{x}}\_{\mathrm{d}} - J(\theta) \dot{\theta}^{\*})$$

$$= -k \boldsymbol{\varepsilon}^{\mathrm{T}} e \le 0,\tag{19}$$

in which the dynamic Equation 18b is also used. It can readily obtained that the tracking error would eventually converge to 0.

It is notable that the dynamic (Equation 18c) satisfies:

$$
\lambda\_2^\* + J\_o \dot{\theta}^\* - B - (\lambda\_2^\* + J\_o \dot{\theta}^\* - B)^+ = J\_o \dot{\theta}^\* - B.\tag{20}
$$

According to the property of projection operator (•) <sup>+</sup>, y−(y) <sup>+</sup> ≤ 0 holds for any y, then we have Joθ˙<sup>∗</sup> − B ≤ 0, together with (5), the inequality (5) is satisfied. Notable that (5) can be rewritten as:

$$
\dot{D} \ge -\text{sgn}(D)\emptyset(|D|). \tag{21}
$$

As to (21), we first consider the situation when equality holds. Since g(|D|) is a function belonging to class K, it can be easily obtained that D = 0 is the only equilibrium of D˙ = −sgn(D)g(|D|). Define a Lyapunov function as V2(t, D) = D 2 /2, and select two functions as α1(|D|) = α2(|D|) = D 2 /2. It is obvious that α1(|D|) = α2(|D|) belong to class-K, and the following inequality will always hold:

$$
\alpha\_1(|D|) \le V\_2(t, D) \le \alpha\_2(|D|). \tag{22}
$$

Taking the time derivative of V2(t, D), we have:

$$\frac{\partial V\_2}{\partial \mathbf{t}} + \frac{\partial V}{\partial D} \dot{D} = -|D| \mathbf{g}(|D|) \le 0. \tag{23}$$

According to Lemma 2, the equilibrium x = 0 is uniformly asymptotically stable. Then we arrive at the conclusion that if the equality d(|OjA<sup>i</sup> |)/dt = −sgn(D)g(|D|) holds, |D| = 0 will be guaranteed, i.e., |OjA<sup>i</sup> | − d for all i = 1 · · · a, = 1 · · · b. Based on comparison principle, we can readily obtain that |OjA<sup>i</sup> | ≥ d when d(|OjA<sup>i</sup> |)/dt ≥ −sgn(D)g(|D|).

Part II. Then we will show the stability of the deep RNN (15). Let ξ = [θ˙<sup>T</sup> , λ T 1 , λ T 2 ] <sup>T</sup> be the a concatenated vector of state variables of the proposed deep RNN, then (15) can be rewritten as:

$$
\epsilon \dot{\xi} = -\xi + P\_{\bar{\Omega}}[\xi - F(\xi)], \tag{24}
$$

where PS(•) is a projection operator to a set S, and F(ξ ) = [F1(ξ ), F2(ξ ), F3(ξ )]<sup>T</sup> ∈ R <sup>n</sup>+m+ab, in which:

$$
\begin{bmatrix} F\_1 \\ F\_2 \\ F\_3 \end{bmatrix} = \begin{bmatrix} \dot{\theta} - J^\mathrm{T} \lambda\_1 + J\_o^\mathrm{T} \lambda\_2 \\ J\dot{\theta} - \dot{\varkappa}\_\mathrm{d} - k(\varkappa\_\mathrm{d} - \varkappa) \\ -J\_o \dot{\theta}^\* - B \end{bmatrix}.
$$

Let ∇F = ∂F/∂ξ , we have:

$$\nabla F(\xi) = \begin{bmatrix} I & -J^T & J\_o^T \\ J & 0 & 0 \\ -J\_o^T & 0 & 0 \end{bmatrix}. \tag{25}$$

According to the definition of monotone function, we can readily obtain that F(ξ ) is monotone. From the description of (24), the projection operator P<sup>S</sup> can be formulated as P<sup>S</sup> = [P; PR; P3], in which P is defined in (13), P<sup>R</sup> can be regarded as a projection operator of λ<sup>1</sup> to R, with the upper and lower bounds being ±∞, and P<sup>3</sup> = (•) <sup>+</sup> is a special projection operator to closed set R ab + . Therefore, P<sup>S</sup> is a projection operator to closed set [; R <sup>m</sup>; R ab <sup>+</sup> ]. Based on Lemma 1, the proposed neural network (15) is stable and will globally converge to the optimal solution of (10). The proof is completed.

#### 4. NUMERICAL RESULTS

In this section, the proposed deep RNN based controller is applied on a planar 4-DOF robot. Firstly, a basic case where the obstacle is described as a single point is discussed, and then the controller is expanded to multiple obstacles and dynamic ones. Comparisons with existing methods are also listed to indicate the superiority of the deep RNN based scheme.

#### 4.1. Simulation Setup

The physical structure of the 4-link planar robot to be simulated in shown in **Figure 2**, in which the critical points of the robot are also marked. As shown in **Figure 2A**, critical points A2, A4, A<sup>6</sup> are selected at the joint centers, and A1, A3, A5, A<sup>7</sup> are selected at the center of robot links. The D-H parameters are given in **Figure 2B**. It is notable that A<sup>i</sup> and the Jacobian matrix Joi are essential in the proposed control scheme. Based on the above description of A<sup>i</sup> , the D-H parameters of A<sup>1</sup> is a<sup>1</sup> = 0.15, a<sup>2</sup> = a<sup>3</sup> = 0, α<sup>1</sup> = α<sup>2</sup> = α<sup>3</sup> = 0, d<sup>1</sup> = d<sup>2</sup> = d<sup>3</sup> = 0, then both the position and Jacobian matrix Ja<sup>1</sup> of A<sup>1</sup> can be calculated readily. Based on the definition in Equation 8, Jo<sup>1</sup> can be obtained. A<sup>i</sup> and Joi can be calculated similarly. The control parameters are set as ǫ = 0.001, α = 8, k = 8. As to the physical constraints, the limits of joint angles and velocities are selected as θ − <sup>i</sup> = −3rad, θ + <sup>i</sup> = 3rad, θ˙− <sup>i</sup> = −1rad/s, <sup>θ</sup>˙<sup>+</sup> <sup>i</sup> = 1rad/s for i = 1 . . . 4. The safety distance d is set to be 0.1m.

#### 4.2. Single Obstacle Avoidance

In this simulation, the obstacle is assumed to be centered at [−0.1, 0.2]Tm, the desired path is set as x<sup>d</sup> = [0.4 + 0.1cos(0.5t), 0.4 + 0.1sin(0.5t)]Tm, and the initial joint angles are set to be θ<sup>0</sup> = [π/2, −π/3, −π/4, 0]<sup>T</sup> rad. The class-K function is selected as G(|D|) = K1|D| with K<sup>1</sup> = 200. In order to show the effectiveness of the proposed obstacle avoidance method, contrast simulations with and without inequality constraint (10e) are conducted. Simulation results are shown in **Figure 3**. When ignoring the obstacle, the end-effector trajectories and the corresponding incremental configurations are shown in **Figure 3A**, although the robot achieves task space tracking to xd, obviously the first link of the robot would collide with the obstacle. After introducing obstacle avoidance scheme, the robot moves other joints rather than the first joint, and then avoids the obstacle effectively (**Figure 3B**). Simultaneously, the tracking errors when tracking the given circle are shown in **Figure 3C**. From the initial state, the end-effector moves toward the circle quickly and smoothly, after that, the tracking error in stable state keeps < 1 × 10−4m, showing that the robot could achieve kinematic control as well as obstacle avoidance tasks. To show more details of the proposed deep RNN based method, some important process data is given. As the obstacle is close to the first joint, critical points A<sup>1</sup> and A<sup>2</sup> are more likely to collide with obstacle, therefore, as a result, the distances between the obstacle O<sup>1</sup> and A1, A<sup>2</sup> are shown in **Figure 3D**, from t = 2s to t = 6.5s, ||A1O1|| remains at the minimum value d = 0.1, that is to say, using the proposed obstacle avoidance method, the robot maintains minimum distance from the obstacle. The profile of joint velocities are shown in **Figure 3E**, at the beginning of simulation, the robot moves at maximum speed, which leads to the fast convergence of tracking errors. The curve of joint angles change over time is shown in **Figure 3F**.

#### 4.3. Discussion on Class-K Functions

In this part, we will discuss the influence of different class-K functions in the avoidance scheme (5). Four functions are selected as G1(|D|) = K|D| 2 , G2(|D|) = K|D|, G3(|D|) = Ktanh(5|D|), G4(|D|) = Ktanh(10|D|), **Figure 4A** shows the comparative curves the these functions. Other simulation settings are the same as the previous one. Simulation results are shown in **Figure 4B**. When selecting the same positive gain K, the minimum distance is about 0.08m, which shows the robot can avoid colliding with the obstacle using the avoidance scheme (5). The close-up graph of the tracking error is also shown, it is remarkable that the minimum distance deceases, as the gradient of the class-K function increases near 0.

Therefore, one conclusion can be drawn that the function can be more similar with Sign function, to achieve better obstacle avoidance.

## 4.4. Multiple Obstacles Avoidance

In this part, we consider the case where there are two obstacles in the workspace. The obstacles are set at [0.1, 0.25]Tm and [0, 0.4]Tm, respectively. Simulation results are shown in **Figure 5**. The desired path is defined as x<sup>d</sup> = [0.45 + 0.1cos(0.5t), 0.4 + 0.1sin(0.5t)]<sup>T</sup> . The initial joint angle of the robot is selected as θ<sup>0</sup> = [1.5, −1 − 1, 0]<sup>T</sup> . To further show the effectiveness of the proposed obstacle avoidance strategy 5, g|D| is selected as g|D| = K1/(1 + e −|D| ) − K1/2 with K<sup>1</sup> = 200. When λ<sup>2</sup> is set to 0, as shown in **Figure 5A**, the inequality constraint (11d) will not work, in other words, only kinematic tracking problem in considered rather than obstacle avoidance, in this case, the robot would collide with the obstacles. After introducing online training of λ2, the simulation results are given in **Figures 5B–H**. The tracking errors are shown in **Figure 5C**, with the transient time being about 4s, and steady state error < 1 × 10−3m. Correspondingly, the robot moves fast in the transient stage, ensuring the quick convergence of the tracking errors. It is remarkable that the distances between the critical points and obstacle points are kept larger than 0.1m at all times, showing the effectiveness of the proposed method. At t = 14s, from **Figures 5D,G**, when the distance between A<sup>3</sup> and O<sup>1</sup> is close to 0.1m, the corresponding dual variable λ<sup>2</sup> becomes positive, making the inequality constraint (11d) hold, and the boundedness between the robot and obstacle is thus guaranteed. After t = 18s, ||A3O1|| becomes greater, then λ<sup>2</sup> converges to 0. Notable that although λ<sup>1</sup> and λ<sup>2</sup> do not converge to certain values, the dynamic change of λ<sup>1</sup> and λ<sup>2</sup> ensures the regulation of the proposed deep RNN.

## 4.5. Enveloping Shape Obstacles

In this part, we consider obstacles of general significance. Suppose that there is a rectangular obstacle in the workspace, with the vertices being [0, 0.5]<sup>T</sup> , [0.4, 0.5]<sup>T</sup> , [0.4, 0.6]<sup>T</sup> and [0.5, 0.6]<sup>T</sup> , respectively. By selecting the safety distance d = 0.1m, and obstacle points as O<sup>1</sup> = [0.05, 0.55]<sup>T</sup> , O<sup>2</sup> = [0.15, 0.55]<sup>T</sup> , O<sup>3</sup> = [0.25, 0.55]<sup>T</sup> and O<sup>4</sup> = [0.35, 0.55]<sup>T</sup> . It can be readily obtained that the rectangular obstacle is totally within the envelope defined by O<sup>i</sup> and d. The incremental configurations when tracking the path while avoiding the obstacle are shown in **Figure 6B**, in which a local amplification diagram is also given. showing that the critical points A<sup>3</sup> is capable of avoiding O<sup>2</sup> and O3. It is worth noting that by selecting proper point group and safety distance, the obstacle can be described by the envelope shape effectively. While in **Figure 6A**, when obstacle avoidance is ignored, the collision emerges. **Figures 6C–H** also give important process data of the system under the proposed controller, including tracking errors, joint angles, angular velocities, and state variables of deep RNNs. We can observe that the physical constraints as well as kinematic control task are realized using the controller.

FIGURE 4 | Discussions on different obstacle avoidance functions. (A) is the comparative curves of different obstacle avoidance functions. (B) is the profile of minimum distance of the robot and obstacle using different obstacle avoidance functions.

FIGURE 5 | Numerical results of multiple obstacle avoidance. (A) is the motion trajectories when ignoring obstacle avoidance scheme, (B) is the motion trajectories when considering obstacle avoidance scheme, (C) is the profile of tracking errors when considering obstacle avoidance scheme, (D) is the profile of distances between critical points and obstacles, (E) is the profile of joint velocities, (F) is the profile of λ2, (G) is the profile of joint angles, (H) is the profile of λ1.

## 4.6. Dynamic Obstacles

In this part, we consider dynamic obstacles moving in the workspace. In real applications, pedestrian or other obstacles always tend to be mobile. Obstacle avoidance for dynamic obstacles is of more general significance. In real time, static obstacles can be considered a special case. In this simulation, the simulation duration is selected as 20s, and the trajectory of a dynamic obstacle is defined as x<sup>d</sup> = [−0.1 + 0.01t, 0.3]<sup>T</sup> . The snapshots in the control process are shown in **Figure 8**. While ensuring effective tracking of the defined path, the robot is able to use its self-motion to avoid the dynamic obstacle effectively, and maintain a safe distance. The distances between critical points and the dynamic O is shown in **Figure 7B**. At the beginning of simulation, the tracking error is big, in order to ensure the convergence of tracking error, the joints move a big range except J1. It is worth noting that since the critical point A<sup>2</sup> is next to O, joint 1 rotates in the direction which conforms to the movement of O. In the stable state, tracking error is < 5 × 10−4m (**Figure 7A**). At about t = 14s, A2O decreases to 0.1m, accordingly, the joint velocities change obviously (as shown the significant change of joint velocities in **Figure 7C**, the tracking error changes to 10−3m, and then converges quickly. At t = 18s, although A<sup>2</sup> and A<sup>3</sup> are near O, the robot is still capable of avoiding the dynamic obstacle. During the control process, joint angles are ensured not to exceed the limits, as shown in **Figure 7D**.

distances between critical points and obstacles, (C) is the profile of joint velocities, (D) is the profile of joint angles.

## 4.7. Obstacle Performance on 7-DOF Manipulator in 3-Dimensional Space

To further verify the effectiveness of the control scheme, another simulation on a 7DOF manipulator iiwa is carried out. The desired path to be tracked is also a planar circular, which is centered at [0, −0.6, 0.1]Tm with radius being 0.15m. The physical parameters can be found in Xu et al. (2019a). Suppose that there exist a cylinder obstacle in the workspace, the obstacle is centered as [−0.13, −0.3, 0]Tm, with the radius and height being 0.15m and 2m, respectively. Simulation results are shown in **Figure 9**. It can be readily found that the proposed schemes can obtain satisfying performance in 3-dimensional spaces.

#### 4.8. Comparisons

To illustrate the priority of the proposed scheme, a group of comparisons are carried out. As shown in **Table 1**, all the controllers in Zhang and Wang (2004); Csiszar et al. (2011); Guo and Zhang (2012); Krzysztof and Joanna (2016) achieve the avoidance of obstacles. Comparing to APF method in Csiszar et al. (2011); Krzysztof and Joanna (2016) of JP based method in Csiszar et al. (2011); Krzysztof and Joanna (2016), the proposed controller can realize a secondary task, at the same time, we present a more general formulation of the obstacle avoidance strategy, which is helpful to gain a deeper understanding of the mechanism for avoidance of obstacles. Moreover, in this paper, both dynamic trajectories and obstacles are considered. The comparisons above also highlight the main contributions of this paper.

## 5. CONCLUSIONS

In this paper, a novel obstacle avoidance strategy is proposed based on a deep recurrent neural network. The robots are obstacles are presented by sets of critical points, then the distance between the robot and obstacle can be approximately describes as point-to-points distances. By understanding the nature escape velocity methods, a more general description TABLE 1 | Comparisons among different obstacle avoidance controllers on manipulators.


\**In Zhang and Wang (2004); Guo and Zhang (2012); Krzysztof and Joanna (2016), dynamic obstacles are not considered.*

\*\**Regular escape velocity method is used, which is only a special case of 5.*

of obstacle avoidance strategy is proposed. Using minimumvelocity-norm (MVN) scheme, the obstacle avoidance together with path tracking problem is formulated as a QP problem, in which physical limits are also considered. By introducing model information, a deep RNN with simple structure is established to solve the QP problem online. Simulation results show that the proposed method can realize the avoidance of static and dynamic obstacles.

## DATA AVAILABILITY

All datasets analyzed for this study are included in the manuscript and the supplementary files.

## AUTHOR CONTRIBUTIONS

Theoretical derivation was done by ZX. Simulation research was performed by ZX, with important advice from XZ. The paper was revised by XZ and SL. All authors approved the manuscript.

#### FUNDING

This work is supported by the GDAS′ Foundation of National and Provincial Science and Technology Talent (Grant No. 2018GDASCX-0603), Guangdong Province Applied Science and Technology Research Project (Grant No.2015B090922010), Guangzhou Science and Technology Planning Project(Grant NO.201803010106), Guangdong Province Science and Technology Major Projects (Grant No. 2016B090910003),

### REFERENCES


Guangdong Province Science and Technology Major Projects (Grant No. 2017B010116005), Guangdong Province Key Areas R&D Program (Grant No. 2019B090919002).

## ACKNOWLEDGMENTS

The authors would like to thank the reviewers for their valuable comments and suggestions.


unknown physical parameters. IEEE Trans. Indust. Electr. 65, 4909–4920. doi: 10.1109/TIE.2017.2774720


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Xu, Zhou and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Novel Model for Arbitration Between Planning and Habitual Control Systems

#### Farzaneh Sheikhnezhad Fard and Thomas P. Trappenberg\*

Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada

It is well-established that human decision making and instrumental control uses multiple systems, some which use habitual action selection and some which require deliberate planning. Deliberate planning systems use predictions of action-outcomes using an internal model of the agent's environment, while habitual action selection systems learn to automate by repeating previously rewarded actions. Habitual control is computationally efficient but are not very flexible in changing environments. Conversely, deliberate planning may be computationally expensive, but flexible in dynamic environments. This paper proposes a general architecture comprising both control paradigms by introducing an arbitrator that controls which subsystem is used at any time. This system is implemented for a target-reaching task with a simulated two-joint robotic arm that comprises a supervised internal model and deep reinforcement learning. Through permutation of target-reaching conditions, we demonstrate that the proposed is capable of rapidly learning kinematics of the system without a priori knowledge, and is robust to (A) changing environmental reward and kinematics, and (B) occluded vision. The arbitrator model is compared to exclusive deliberate planning with the internal model and exclusive habitual control instances of the model. The results show how such a model can harness the benefits of both systems, using fast decisions in reliable circumstances while optimizing performance in changing environments. In addition, the proposed model learns very fast. Finally, the system which includes internal models is able to reach the target under the visual occlusion, while the pure habitual system is unable to operate sufficiently under such conditions.

Keywords: machine learning, reinforcement learning, supervised learning, habitual controller, planning, internal models, decision making

## 1. INTRODUCTION

Much of the current reinforcement learning (RL) literature implements model-free control. Such a learning agent learns a value function from interacting with the environment, usually updating a proposed value function from a temporal difference between the previous expectation and a new experience (Mnih et al., 2013, 2015). The value function is like a big lookup-table that can quickly supply evaluations for possible actions and hence provide guidance for actions in a fast and somewhat automated way. Such a decision system can be characterized as habitual. While habitual action selection takes time to learn and requires that similar previous situations have been encountered sufficiently, the advantage is that decisions and corresponding actions can be generated in a timely manner.

#### Edited by:

Florian Röhrbein, Technical University of Munich, Germany

#### Reviewed by:

Eiji Uchibe, Advanced Telecommunications Research Institute International (ATR), Japan Yangwei You, Institute for Infocomm Research (A\*STAR), Singapore

#### \*Correspondence:

Thomas P. Trappenberg tt@cs.dal.ca

Received: 27 February 2019 Accepted: 28 June 2019 Published: 11 July 2019

#### Citation:

Sheikhnezhad Fard F and Trappenberg TP (2019) A Novel Model for Arbitration Between Planning and Habitual Control Systems. Front. Neurorobot. 13:52. doi: 10.3389/fnbot.2019.00052

In contrast, a system that has some internal models of the environment can be used to derive a value function on demand for a specific situation. A prime example is a Markov decision problem where the reward function and transition function of the agent are known so that the Bellman equations can be used to calculate the optimal value function for every state action pair without the need to explore the environment. Of course, this system requires learning of the internal models, which requires previous interactions with the environment. The learning of internal models can be achieved through some form of supervised learning. Once the models have been learned, the model-based system is able to calculate a value function on the fly. This resembles some form of internal deliberation. The advantage of such a system is its flexibility to new situations. However, deliberations take time so that a habitual system is preferable when it comes to situations that benefit from a higher degree of automation.

In this paper, we introduce a learning system that we call the Arbitrated Predictive Actor-Critic (APAC) that combines a habitual reinforcement learning system with a supervised learning system of internal models. Most importantly, we introduce an arbitration system that mediates between their usage. We specifically discuss a situation in which both systems alone can solve an exemplary task so that we can study the consequences of their direct interactions in relation to their exclusive use. We show that this system is responsive to changes in the environment and that it can learn the reward function very fast. Our results demonstrate how the learning paradigm tend to rely on habits after learning the reward function. Our results are in line with evidence of human behavior mentioned above.

## 2. THEORETICAL PREMISES

There is a lot of behavioral and neurophysicological evidence for different types of control systems in the brain that are usually termed habitual or model-free and goal-directed or model-based (Balleine and Dickinson, 1998; Gläscher et al., 2010; Daw et al., 2011). In particular, one control system associated with the prefrontal cortex (Miller and Cohen, 2001) predicts action-outcomes using an internal model of the agent's environment and hence can be associated with a control system that uses deliberative planning. We will use in this paper the term deliberative planning instead of goal directed model-based control to minimize the possible confusion between the models of the environment from the models of the value function. Another control pathway in the brain is associated with the dorsolateral basal ganglia (Houk and Barto, 1995) learns to repeat previously rewarded actions that resemble a habitual system.

Some research showed that the two different control systems are used in different situations and can be simultaneously active (Poldrack et al., 2001; Lengyel and Dayan, 2008). For example, in the brain, the cortical system represents a generalized mapping of input distributions while hippocampal learning is an instancebased system (Lengyel and Dayan, 2008). Moreover, when the model of the environment is known and there is sufficient time to plan, the best strategy is deliberate planning (Daw et al., 2005), but when the decision should be taken very fast the habitual controller is used (Blundell et al., 2016). Other work shows that cooperation and competition between different control systems in the brain happens especially when outcomes of each control system disagree, that is, if a deliberated planning task does not align with a habitual skill (Daw et al., 2005, 2011; Daw and O'Doherty, 2013; Lee et al., 2014).

Moreover, feedback to learning systems can differ in different situations and can be provided from different modalities such as vision or auditory input. In machine learning, it is common to distinguish different learning paradigms. One is supervised learning where a teacher gives feedback from the knowledge of a desired behavior. The system can be trained by comparing the actual output of a leaner to the desired output provided by the teacher. The teacher is basically a critic who can quantify an objective function that a leaner needs to optimize. Another slightly more general learning paradigm is reinforcement learning where the environment only provides some indication of value, often only after a series of actions have been chosen by a learning agent. Reinforcement learning is thus more general in that it can be applied to a lot of more typical learning situations of an agent in an environment. The subsystems in our model align in our implementation with a supervised paradigm to learn internal models and a reinforcement learning paradigm to learn habitual control.

Habitual reinforcement learning which is based on TD learning (Sutton, 1985) has been very successful in explaining experimental evidence from the animal learning literature and dopamine-based learning in the brain (Barto, 1995; Schultz et al., 1997). Such models which have originally been formulated with tubular methods based on discrete state action spaces are now commonly combined with neural networks as a function approximator that broadens the range of practical applications to be solved using RL, especially for control problems with continuous states/actions spaces (Waltz and Fu, 1965; Barto et al., 1990). Barto, Sutton and Anderson introduced the Actor-Critic architecture that was implemented by neural networks (Barto et al., 1983). Later, Barto (1995) represented an adaptive critic which has similar behavior to the dopamine neurons projection to the Striatum and frontal cortex. The adaptive critic uses the internal sensory information to learn an effective reinforcement signal.

It has long been hypothesized that the brain builds an internal representation of the world and its body (Miall et al., 1993; Miall and Wolpert, 1996; Wolpert et al., 1998; Kawato et al., 2003), and evidence shows that "forward" and "inverse models" exist in the brain (Miall et al., 1993; Kawato et al., 2003). The internal model is used to perform in the environment and learn a new task. Flanagan and Wing (1997) showed that the internal model can predict the load force and the kinematics of a hand movement that depends on the load. Moreover, when learning how to use a new tool, humans make a transient change in the internal model of the arm as well as making an internal model of the tool (Kluzik et al., 2008). Furthermore, imitation experiments show that a direct mapping develops between observation and the internal model (Iacoboni et al., 1999). Another advantage of having an internal representation

is obtaining a reliable source of information for the agent to perform accurately even if there are no other sources of information (e.g., visual information) available (Wolpert et al., 1998; Kawato et al., 2003).

In this paper, we propose a model to study the cooperation and competition between a habitual and planning-based control components with an arbitrator component. The general architecture of the proposed Arbitrated Predictive Actor-Critic (APAC) is shown in **Figure 1**. In this model, each control paradigm implies a specific type of teaching feedback. The deliberative planning controller incorporates internal models that are usually trained with supervised errors so that we consider here an explicit state predictions error. In contrast, the habitual action selection system is a common deep reinforcement learner which learns from reward prediction errors. The new component here is an arbitrator that mediates between these systems that can select the command given to the controlled system, the agent or in the plant in the common language of control theory. Of course, it is possible that both decision systems are trained with a combination of supervised and reinforcement learning, but this is not the crucial point in this paper. The model is designed to study how a combined control system behaves in different environmental situations. In the following section we apply this general model to a specific motor control task in which both systems can be trained on the same feedback signal, but in which the execution would follow a habitual or planning implementation.

## 3. APAC FOR TARGET REACHING

In this section, we apply the APAC model to the motor control task of target reaching. We choose this task as it is a good example of a minimal control task while being complex enough to highlight the advantages and disadvantages of the two principle control architectures discussed in this paper. Target reaching lives in a continuous state and action space with 6 degrees of freedom when considering a shoulder and elbow yaw, pitch and roll, although we simplify this here even more to a 2-dimensional system with only one angle for the elbow and one the shoulder. Learning the reaching task in this 2D environment is learning a non-linear mapping function that

maps joint angles of the robot arm onto a location of the end-effector in the environment. An example image of our simulated robot arm is shown in **Figure 2** with the black line. The contour plot shows the distance to the target while the dotted green line shows an internal model of the robot arm early in learning.

The refined control architecture of our APAC model for the reaching tasks is shown in **Figure 3**. For this application, the state is defined as the position of the elbow, the end-effector, and the target. The planning component is now implemented as a combination of deep forward and inverse models, while the habitual system is implemented as a deep actor-critic model. An integrator is used to derive the training signals that are used for the feedback. In the following, we specify each subsystem in detail.

## 3.1. Habit Learning Control System

The habitual controller is implemented as a deep deterministic policy gradient model (DDPG) following the work of Lillicrap et al. (2015). The arm position is given by the vector X with two vector components, the position of the end-effector X end and the position of the elbow X elbow. The arm position together with the target location Xtarget defines the current state s<sup>t</sup> = [X end , X elbow, Xtarget] of the agent.

The critic Q(s<sup>t</sup> , at; θ <sup>Q</sup>) is implemented as a deep neural network, where s<sup>t</sup> is the current state at time t, a<sup>t</sup> is the action taken at time t, and θ <sup>Q</sup> are the parameters of the critic network. The goal of the critic is to approximate the accumulation of the environmental reward (sometimes called return) that can be expected from a certain state action

and the actor goes to the arbitrator. Then the arbitrator selects one of these actions. The output of the arbitrator along with the current state is the input to the critic from which the critic predicts a reward. The forward model receives the selected action and the current position of the agent and predicts the future state of the agent. The agent takes the selected actions and transfers to the next state. The predicted future state from the forward model is integrated with the estimation of the actual state of the plant after taking the action and obtains a new current state for the system.

combination. The critic is learned through temporal difference (TD) learning (Sutton, 1988; Schultz et al., 1997).

$$\mathcal{Q}(s\_t, a\_t; \theta^{\mathcal{Q}}) \leftarrow \mathcal{Q}(s\_t, a\_t; \theta^{\mathcal{Q}}) + l\_1 \delta,\tag{1}$$

$$\delta = \underbrace{r\_l + \nu \max\_{a'} Q'(s\_{l+1}, a'; \theta^{\triangle})}\_{\text{estimated reward}} - \underbrace{\mathcal{Q}(s\_l, a\_l; \theta^{\triangle})}\_{\text{actual reward}},\tag{2}$$

where l<sup>1</sup> is the learning rate of the critic network, r<sup>t</sup> is the actual immediate reward received from the environment at time t, γ is a discount factor, and Q′ represents the estimation of the value of a state-action pair. As in DDPG, we use the main network for training but we use a target network for predicting, which is a less frequently updated copy of the main network to avoid oscillation. More precisely, DDPG actually has two target networks, one for the critic network and one for the actor network. We follow directly the smooth update for the target networks as in DDPG (Lillicrap et al., 2015),

$$
\boldsymbol{\theta}' = \boldsymbol{\theta}' \times (1 - \mathbf{r}) + \boldsymbol{\theta} \times \mathbf{r}, \tag{3}
$$

with change parameter τ ≪ 1. The parameters θ ′ represents the target network parameters, and θ is the parameter of the main network, either the actor or the critic.

The computation in Equation (2) is done in the TD component. To train the critic using the TD rule, the error needs to be back-propagated through the critic. The error between estimated value and the actual value is used to compute the loss function of the critic (Equation 4),

$$L^{\mathcal{Q}} = 1/N \sum \text{(\(\delta\)}^2. \tag{4}$$

DDPG takes advantage of the experience memory replay which is a memory to store and reuse past experiences. The memory replay is in form of R(s<sup>t</sup> , a<sup>t</sup> ,rt ,st+1, Tt), where s<sup>t</sup> is the current state at time t, a<sup>t</sup> is the action taken at time t, st+<sup>1</sup> is the next state, r<sup>t</sup> is the reward received at time t, and T<sup>t</sup> indicates whether the state at time t + 1 is a terminal or not. The replay memory is a queue-like buffer with a finite size. The agent will forget older experiences and it will update its parameters based on its recent experiences. At each time step, a random batch of N samples is selected from the experience memory replay, and this batch is used to train both the actor and the critic.

The actor (π) receives the current state (st) and predicts future actions to be taken (at).

$$
\pi(s\_t; \theta^\pi) = a\_t,\tag{5}
$$

where a<sup>t</sup> = [α1, β1] and α<sup>1</sup> and β<sup>1</sup> are motor commands sent to the shoulder and elbow, respectively. The actor is implemented as a deep network where θ π indicates the parameter of the actor network and is trained using the deterministic policy gradient method (Silver et al., 2014). Note that the main actor network is used for training, however, the target network of the actor is used for the action prediction.

The changes of the weights of the actor corresponded to the changes in expected reward with respect to the actor's parameters,

$$\theta\_t^{\pi} \leftarrow \theta\_t^{\pi} + l\_2 \frac{\partial \mathbb{Q}(\mathbf{s}\_t, a\_t; \theta^{\triangle})}{\partial \pi \{\mathbf{s}\_t; \theta\_t^{\pi}\}} \frac{\partial \pi \{\mathbf{s}\_t; \theta\_t^{\pi}\}}{\partial \theta\_t^{\pi}},\tag{6}$$

where l<sup>2</sup> here is the learning rate of the actor. The plant, which is the simulated arm in our example, takes the action and transitions to its new position [Z end t+1 , Z elbow t+1 ], which forms the new state st+<sup>1</sup> when combined with the target location. Like DDPG we apply noise to the environment using an Ornstein-Uhlenbeck process (Uhlenbeck and Ornstein, 1930) that results in new samples.

#### 3.2. Internal Models for Planning

For the planning controller, we need to learn the transition function of the plant to build the model of the environment. Here we use supervised learning to learn the internal representation of the agent. More specifically, we used a supervised learning controller that uses past experiences to generalize an inverse model of the arm and a forward model of the arm. The training examples used in our implementation are obtained from the same experience replay memory that is used for the habitual controller.

A combination of a forward and an inverse model is used for planning the next actions. The forward model f<sup>F</sup> is a neural network that receives the current position of the arm [X end t , X elbow t ] and the action a<sup>t</sup> and predicts the future position of the arm [X ′end t+1 , X ′elbow t+1 ]. We can train the network from the discrepancy between the predicted future position [Z end t+1 , Z ellbow t+1 ] and the actual position from visual feedback. For training we use the loss function

$$L^{f\_F} = \frac{1}{N} \sum\_{t} ([X\_{t+1}^{\prime end}, X\_{t+1}^{\prime \prime \prime \text{llow}}] - [Z\_{t+1}^{\prime end}, Z\_{t+1}^{\prime \prime \text{llow}}])^2,\tag{7}$$

where N is the number of selected samples in a batch of experiences stored in the replay memory. The size of the batch to train the forward model and the inverse model is the same as the one used for the actor and the critic.

An inverse model is another deep network, fI(st; θ <sup>f</sup>I). The aim of the inverse model is to provide a proper action to reach the target by minimizing the error between predicted action (a ′ t ) with the actual action taken (at) that transfers the agent from the current position to its next position. This network is then trained on the loss function:

$$L^{\rm f} = \frac{1}{N} \sum\_{t} (a\_t - a\_t')^2. \tag{8}$$

The aim of having the forward model is learning to predict future positions of the agent by taking specific actions. Such a model enables the agent to perform the task even with occluded vision. When the inverse model has been trained well, it can be used to produce a suitable action to transfer the agent from its current state to the target location by replacing Xtarget with Z end t . Hence, the inverse model can be trained with the input [X end t−1 , X elbow t−1 , Z end t ] and predicting the proper actions on [X end t−1 , X elbow t−1 , Xtarget]. Note that [X end t−1 , X elbow t−1 ] are part of states st in the replay memory while Z end t is taken from st+<sup>1</sup> in the replay memory.

Another component of the overall system is "the integrator" module. In general, the integrator could be a Bayes filter such as a Kalman filter which estimates the best estimated position from the available information that combines an internal model prediction with external sensory feedback. Since we use a reliable visual feedback in our case study we simplify this to an integrator that passes the actual location of the plant in case visual information is available. With occluded vision, the prediction of the forward model is used as the estimated actual position of the agent. In our previous work (Fard et al., 2015), we showed how to implement a Kalman Filter with Dynamic Neural Fields (Amari, 1977). The integrator is the explicit critic in this example, which provides the state prediction error for the forward model (see **Figure 1**).

A training session of the system includes an infant phase that uses "motor babbling" (Iverson and Fagan, 2004; von Hofsten, 2004; Demiris and Dearden, 2005; Iverson et al., 2007; Caligiore et al., 2008). During the babbling phase, the plant produces random movements with random actions to produce actual samples to be stored in the experience memory. In the babbling phase, the actual position of the arm after taking an action is considered the target location. Therefore, all samples in this case reach the terminal state and will gain the maximum reward value. The babbling phase is used to provide valid examples to train both control systems.

## 3.3. Arbitration Between Habitual and Planning Controllers

A novel component of APAC is an arbitrator. The arbitrator receives action predictions from the deliberative planning module (the inverse model), and the habitual action selection module (the actor), and makes the final decision of which action to use. This selected action is transferred to the actuators of the plant to bring the agent into its new position resulting in a new state when combined with the target location. The arbitrator's decision is also fed into the forward model and the critic for training purposes. As in DDPG, noise from an Ornstein-Uhlenbeck process is added to both proposal actions provided by the inverse model and the actor.

In our implementation of the APAC, we consider discrete action steps so that both controllers (habitual and planning) create actions at each step. However, it is known that the habitual controller is faster than deliberative planning. Therefore, to imply the time constraint we set the arbitrator to give priority to the habitual controller. Moreover, the arbitrator is set to always take the action that is provided by the habitual system for the first two steps of every episode. However, from the third step on, the actor's prediction is taken if the habitual controller is reliable, meaning that the reward prediction error for the last experience is smaller than a threshold. We use abs(δ) < 1 in the following experiments. Otherwise, the action from the inverse model is selected.

The implementation of the arbitrator here is somewhat a minimal model suitable for our experimental setting and to highlight the consequences of such arbitration. Of course, it is possible to implement a more dynamic realization of such an arbitrator. For example, the threshold could itself be modulated according to the tasks and in this way produces a more rich speed-accuracy trade-off (Satel et al., 2005). Indeed, such modeling will open the possibility to discuss behavioral consequences with different system settings a ultimately compare them to variations in populations or psychiatric disorders such

as eduction and eating disorders (Huys et al., 2016). However, the simple implementation discussed in this paper captures the minimal assumptions as outlined above and is sufficient for the following simulations.

## 3.4. Experimental Conditions and Environment

To test the APAC model on a simulated robot arm with a target reaching task (**Figure 2**), we simulated a two-joint robotic arm whose range of motion at each joint was constrained to 180 degrees. The arm's motion was limited to a 2D plane of width 30 and height 30, upon which the arm's "shoulder" was fixed in the center (15,15). The initial length segment from the shoulder to the elbow was set to l<sup>1</sup> = 5, and the initial length of the lower segment ("hand" to "elbow") is l<sup>2</sup> = 8.

All experiments described herein had an episodic trial structure. At the beginning, the arm's position was set to zerodegree angle at the shoulder and 180-degree angle at the elbow. Time was discretized in the simulations, and the learning agent was given only 30 action-steps per trial to achieve the designated goal. We define a "target zone" as a circle centered at the target location with a radius rtarget = 0.5. The target is defined as "reached" once the robot arm is inside the target zone. If the goal was not reached within 30 time-steps, the trial was aborted and a new trial was started.

Importantly, the reward function is defined as the negative Euclidean distance between the end-effector and the center of the target area. This is for this example tasks the same information as is given to the supervised learner. This was deliberatively done so that the different systems are compared on the same feedback situation. It is possible to learn this task from much simpler feedback such as some reward if the target area is reached vs. no reward otherwise, although this would then also need more time to train the habitual system. The point of our study here is rather the direct comparison of decision components based on a value lookup vs. learning internal models.

Within this environment we define several conditions that defined the variety of the different target-reaching tasks studied here. These conditions include the target position (**static target** vs. **changing target** at each episode), kinematics (arm dimensions as **static kinematics** vs. **changing kinematics**), and vision (**occluded vision** vs. **perfect vision**). We tested all combinations of these factors.

Each experiment consisted of 1,000 episodes of maximal 30 action steps each. In the static target condition, the target is initialized randomly and stays unchanged for all 1,000 episodes. For the changing target condition, the target is located at a random location at the beginning of every episode. As discussed above, each episode was terminated when either (A) the target was reached, or (B) 30 time steps had elapsed. Targets were only placed within a reachable distance for the arm. The arm dimensions were kept fixed in the case of static kinematics; however, the length of both the upper and lower arm segments were increased by 0.001 at each time step for the changing kinematics condition. These changes were only started after the 100th episode of target reaching to provide some time for basic training. As already mentioned, an environmental noise was included in all experiments. In the occluded vision condition, the location of the arm and the target was unavailable for the agent during the movement. This task is also known as memory guided target reaching (Westwood et al., 2003; Heath et al., 2004). We repeated all static/changing kinematics and static/changing target conditions with our proposed models for the reaching-target task in the occluded vision condition. To examined the generalization of the models under each condition, we trained each model when targets are located only in a specific area that represents 2/3 of whole reaching area, and tested with targets located in the other part of the environment, which is the rest 1/3 of the reaching area.

## 4. RESULTS

We considered three versions of APAC that represent (a) exclusive habits, (b) exclusive deliberate planning, and (c) arbitration between habit and planning. Exclusive habit is when the arbitrator is set to always pick the action from the habitual system. In this case, the APAC behaves exactly like DDPG. If the arbitrator always selects the action from the inverse model for each step, then the APAC becomes an exclusive deliberate planning controller which we call supervised predictive actorcritic (SPAC) (Fard et al., 2017). The third model is when the APAC is able to arbitrate between the actions provided by the inverse model and the actor.

For each condition, we trained 50 independent instances of each model for a total of 1,000 episodes. At the end of the 1000th episode, all network parameters were frozen and no more training was applied. Subsequently, each of the independent model instances performed 100 target reaching episodes under the respective training conditions. In the case of the occluded vision, sensory input (i.e., visual target position) was initially presented at time step 0, and subsequently rendered unavailable. Since DDPG has no internal model and requires visual input throughout the task, we only compare APAC with SPAC in the experiments with occluded vision. All experiments were implemented and tested in Python (3.5) using the TensorFlow (1.3) package (Abadi et al., 2016) on an NVIDIA GeForce GTX 960 graphical processing unit.

**Figure 4** illustrate the percentage of trials that reach the target within 30 action steps during training under different conditions for 1,000 episodes each. The pure deliberate planning model SPAC reaches almost near perfect performance very fast after 100 episodes. Furthermore, SPAC's performance is very robust under different conditions and neither changes in reward function nor in kinematics effects the performance of SPAC very much. DDPG learns to reach almost 100% of the targets only under static target/static kinematics condition. Performance of DDPG drops slightly under changing kinematics compared to static kinematics; however, its performance drops dramatically (about 20%) under changing target conditions. This is of course expected as habits become invalid solutions under changing environments. Our point here is that APAC can reach almost all of the targets both under static and changing targets as good as SPAC, although it tends to use more habits than planning after a few trials of

learning(see **Figure 6**). The speed of learning in APAC is also very high and comparable to SPAC. In this sense it combines the benefits of DDPG and SPAC.

The above curves give an example of behavior of the models during one learning trials. To study how these results generalize we tested the performance of all three models after learning over 50 different learning trials with random initial conditions for the networks. For the static target location we tested on the target location, that was randomly chosen for each learning trial. However, with the changing target location we decided to cover the possible target locations more systematically and set target points on a regular grid in angle space. **Figure 5** displays average success rate over the 50 learning trials to reach these targets. All three models under the static target/static kinematics condition reach 100% of the targets. DDPG and APAC have slightly less success under static target/changing kinematics, while SPAC stays flexible under this condition. The major difference between DDPG and APAC become clear under changing target conditions, where DDPG's performance drops dramatically, while APAC obtains very good performance. SPAC is still very flexible to reach targets under changing target conditions. Overall it is remarkable how close APAC stays to the overall performance of SPAC in a situation where deliberative planning is the better choice.

The overall success rate does show the entire range of the solutions. We thus included the individual performances in terms of the average number of steps to reach the target. As can be seen, APAC needs to take sometimes more corrective steps to reach the target while an exclusive planning system can optimize the number of steps. This is interesting as this allows for different strategies in solving the task, that of relying somewhat on habitual control when the cost of the movement initiation might be small vs. more deliberate planning when the number of action steps might matter. This can explain a form of speed-accuracy trade-off.

FIGURE 6 | Arbitration between habitual and planning by the APAC: For each condition, the blue line indicates the average number of actions which are selected by the habitual controller (i.e., the actor), while the red line demonstrates the average number of actions selected by the planning controller (i.e., inverse model). Results illustrate that planning controller is used early in the training, while later agent tends to use the habits more.

**Figure 6** illustrates how APAC gradually shifts from a planning to a habitual control approach with increasing experience. After around 300 episode, more than 80% of APAC actions were taken from the habitual controller.Of course, SPAC uses planning control throughout the entirety of the task, so no commensurate figure was generated for it.

Since a habitual system should be faster than deliberate planning, this figure also illustrates that APAC would be less timeconsuming than the SPAC at the same task and under the same condition. To visualize average time consumption by each model under different conditions, we assumed that each action selected by the deliberative planning takes three times longer than an action selected by the habitual controller. The number here is arbitrary and only chosen visualize the general effect. The top row of plots in **Figure 7** show the average number of action steps that each of the three models need to complete the task at each episode under different conditions. These plots demonstrate that the number of steps are almost the same under static target/static kinematics conditions. Under changing target conditions DDPG needs more steps to complete the task than APAC and SPAC. However, when including a higher cost for deliberative planning in the plots shown in the bottom row of the **Figure 7**, the picture for the average number of time that is needed to complete the task changes. In this case, DDPG needs shorter time under static target conditions. However, under changing target conditions, APAC completes the task of reaching targets faster. DDPG takes longer to finish the task as it needs more corrective actions.

**Figure 8** shows all 50 runs to reach 100 targets of a reaching test under changing target conditions, with (bottom row) and without (top row) changing kinematics, for all three models.

The plots illustrate that DDPG has some difficulty reaching the locations at the edges of the possible target area due to nonlinearity of the mapping between angles and locations. SPAC can learn the mapping function much better, and the quality of APAC is similar to SPAC. Interestingly, although APAC tends to use more habits than deliberate planning, this model can still reach many more targets than DDPG, almost as good as SPAC.

We also tested a form of generalization of each model where a whole area of the target zone was not seen during training. This is a case of extrapolation compared to the interpolation trials in the previous generalization experiments. More specifically, we trained each model to reach the target located at a specific region in the environment and we tested each model to reach targets that are located on the unseen area of the environment (see the left most plot in the **Figure 9**). The same distribution of target locations has been used here and only those that are located in the blue area are set as targets for this experiments. Therefore, there are about 39 targets under static kinematics conditions and 42 targets under changing kinematics conditions (because of changing kinematics more targets will locate in the testing area). The results show that under static target training, neither model can reach even half of the targets. Their performance is worse under static target and changing kinematics. However, under changing target conditions, all models have obtained a good generalization but they need to take more than one step to reach any target. SPAC has again the best performance among other models under all conditions, while DDPG has the worst performance compared to other two models. These results indicate that learning with a static target hinders generalization as the learner overfits this specific target location.

Finally, we want to show results with occluded vision. Since the habitual controller (DDPG) requires sensory input at all times, only the SPAC and APAC models were compared under this condition. In these experiments, the arm moves toward the target when the target location is only visible at the first step. When the forward model indicates that the agent has reached the target it stops and the actual distance from the agent (here the arm) is measured. Results of these occluded vision test are summarized in **Figure 10** under changing target conditions, since the performance of both models under static target conditions are near perfect.

The target zone is marked by the red line in each plot, while the average distance from the agent to the target location is drawn as a black line. The blue area shows the range of the distances that the arm has experienced during the occluded trials when the forward model thinks that it has reached the target. SPAC shows better performance under occluded vision compared to APAC under all conditions. The average actual distance of the end-effector to the target location under changing target/ static kinematic with SPAC is only about a

FIGURE 9 | (Left) The reaching area for arm under static kinematics conditions. The red area is used for training while the blue region is for testing. Around edges are more colored since these points can be reached with many more sets of angles. (Right) Performance of each model is shown under different conditions to reach about 40 targets located in the blue region. All models obtain a good generalization under changing target conditions.

distance of 0.4, which is less than the target zone radius. However, the APAC model under the same conditions stays about a distance of 1 away from the target. Under changing target/changing kinematics, this average actual distance from end-effector to the target is around 0.7 for SPAC and about 1.5 for APAC. It thus seems that any form of habit should be suppressed in such situations, which could be achieved by a more advanced arbitrator.

## 5. CONCLUSION

This paper is about the study of a hybrid system with deliberate planning system and habitual control. Habitual control will, of course, be very good after long training in static environments. It was hence important to study the model in changing environments. We investigated the behavior of our proposed model (APAC) under changing target conditions (to manipulate the environmental reward function), changing kinematics of the agent (to manipulate the learned transitional model), and with and without vision. We also tested the model under various generalization conditions to see how good they can interpolate and extrapolate. The main results are classified as below:

**Adaptive to changes:** Results show significant improvement in performance when planning is available compared to the pure habitual system under the conditions when of changing environments that includes changing reward conditions and changing kinematics of the agent. In comparison, SPAC and APAC are flexible under these changes. These experiment shows that having an internal model is a key to robustness on changing environment.

**Moving from planning to habits:** By considering that planning is costly, having another control system that is able to provide less costly solutions can be useful. In this paper, there is no inherent time constraint or computation time difference between the habitual and planning controller. However, if we take this time constraint into account, the APAC is a better model than pure planning (SPAC). Indeed, the overall number of actions taken based on a planned action takes less time than the arbitrated model than the number of planned actions taken by the non-arbitrated model when considering some cost of planning.

**Reaching under occluded vision:** Since DDPG has no internal model, it can not use any sort of planning to move under occlusion. However, systems like SPAC and APAC build an internal model of the environment that enables them to anticipate and plan a target even when there is no visual information available. Results show that SPAC can perform better in the dark. This is a good example to show that a more sophisticated arbitrator could take different circumstances into account. For occluded vision, such an arbitrator should suppress habitual actions.

Our application example focused on the implementation of the habitual system as an actor-critic for reinforcement learning and a planning system with a forward and inverse model using supervised learning. There have been previous examples of combining both, some form of supervised learning with RL systems and the use of the internal model. In particular, Dyna-Q (Sutton, 1991) and supervised actor-critic (Rosenstein and Barto, 2002; Barto and Rosenstein, 2004) are examples of models that bring the capacity of planning into a modelfree reinforcement learning space. For example, the supervised actor-critic (Rosenstein and Barto, 2002; Barto and Rosenstein, 2004) is able to tune the actor manually and very fast when it is needed. This solution is beneficial when dynamics of the system changes dramatically or a new policy is needed to be learned in a very short time. These authors used a gain scheduler that weights the control signal provided by the actor from the reinforcement learner and the supervised actor. In contrast to supervised actor-critic, our proposed model autonomously learns the internal model and arbitrates between the two controllers automatically and not manually. The intention of our model is to study the interaction of habitual and planning systems in form of an arbitrator and ultimately to understanding human behavior.

Sutton proposed Dyna-Q that is an integrated model for learning and planning (Sutton, 1991). With respect to our model, Dyna-Q is also a blend of model-free and modelbased reinforcement learning algorithms. Dyna-Q can build a transition function and the reward function by hallucinating random samples. Therefore, although it uses a model-free paradigm at the beginning it becomes a model-based solution by learning the model of the world using the hallucination. In our model, the internal model is used to predict the future state of the agent unlike Dyna-Q that uses the model to train the critic and anticipate the future reward. Moreover, Dyna-Q starts from model-free controller and becomes a model-based controller. Hence, while Dyna-Q has focused on the utilization of internal models to learn a reinforcement controller, our model and study here is concerned with the arbitration of two control systems. However, since APAC tend to select actions from the planning controller that it learns very fast, it provides more accurate samples in the experience replay memory. Therefore, similar to the Dyna-Q, the habitual system takes advantage of learning from more valuable samples that lead the habitual controller to a better performance compared to the time that it is trained stand alone (in pure habitual paradigm).

Another interesting approach by Uchibe and Doya (2005) explores a collection of different controllers. Their work takes also different times constrains into account. In contrast to our model, the controllers are combined probabilistically in a more collaborative way while our approach focuses more specifically on understanding the competitive decision making of a deliberative vs. habitual systems. The experiments are also somewhat different. Even though the barriers in their experiment are static, it seems that their habitual controller can not learn this task. In our experiment we made sure that the experimental task is learnable by both controllers in the static case. In addition we study the performance with changing kinematics and changing targets.

Not only have human decision-making studies supported the notion that both habitual and planning controls are used during decision-making (Daw et al., 2011; O'Doherty et al., 2015), there is evidence that arbitration may be a dynamic process involving specific brain regions Lee et al. (2014). Our APAC model results suggest that such an arbitration strategy, wherein the planning paradigm is used until the habitual system's predictions become reliable can result in performance that is non-inferior to exclusive planning control in most cases. Thus, our APAC model supports (A) the importance and value of implementing predominantly planning control early in behavioral learning and (B) the diminishing importance of planning control with greater experience in a relatively static environment.

## DATA AVAILABILITY

All datasets generated for this study are included in the manuscript and/or the **Supplementary Files**.

## AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

## ACKNOWLEDGMENTS

The authors would like to thank Abraham Nunes for valuable input and acknowledge funding from Natural Science and Engineering Research Council of Canada (NSERC, Grant number 249885) and NS graduate scholarship.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fnbot. 2019.00052/full#supplementary-material

### REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Sheikhnezhad Fard and Trappenberg. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# From Rough to Precise: Human-Inspired Phased Target Learning Framework for Redundant Musculoskeletal Systems

Junjie Zhou1,2,3, Jiahao Chen2,3,4, Hu Deng1,3 and Hong Qiao1,2,5 \*

*<sup>1</sup> State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China, <sup>2</sup> School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, <sup>3</sup> Beijing Key Laboratory of Research and Application for Robotic Intelligence of "Hand–Eye–Brain" Interaction, Beijing, China, <sup>4</sup> Research Center for Brain-Inspired Intelligence, Institute of Automation, Chinese Academy of Sciences, Beijing, China, <sup>5</sup> CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai, China*

Redundant muscles in human-like musculoskeletal robots provide additional dimensions to the solution space. Consequently, the computation of muscle excitations remains an open question. Conventional methods like dynamic optimization and reinforcement learning usually have high computational costs or unstable learning processes when applied to a complex musculoskeletal system. Inspired by human learning, we propose a phased target learning framework that provides different targets to learners at varying levels, to guide their training process and to avoid local optima. By introducing an extra layer of neurons reflecting a preference, we improve the Q-network method to generate continuous excitations. In addition, based on information transmission in the human nervous system, two kinds of biological noise sources are introduced into our framework to enhance exploration over the solution space. Tracking experiments based on a simplified musculoskeletal arm model indicate that under guidance of phased targets, the proposed framework prevents divergence of excitations, thus stabilizing training. Moreover, the enhanced exploration of solutions results in smaller motion errors. The phased target learning framework can be expanded for general-purpose reinforcement learning, and it provides a preliminary interpretation for modeling the mechanisms of human motion learning.

Keywords: musculoskeletal system, human-inspired motion learning, noise in nervous system, reinforcement learning, phased target learning

## 1. INTRODUCTION

Research on human-like musculoskeletal robots has become multidisciplinary in recent years, as it involves fields such as neuroscience and materials science for modeling and implementing musculoskeletal motor systems. In fact, this branch of robotics mainly comprises muscle models (actuators), skeletal systems (supporting structure), and methods for motion control and learning (control systems). Related work can roughly be divided into two types, namely, muscle dynamics modeling along with hardware design (Jäntsch et al., 2013; Kurumaya et al., 2016; Asano et al., 2017) and musculoskeletal robot control (Pennestrì et al., 2007; Jagodnik and van den Bogert, 2010; Tahara and Kino, 2010). Although most studies have been focused on the first type, the development of neuroscience has gradually increased the research on human-inspired control.

#### Edited by:

*Changhong Fu, Tongji University, China*

#### Reviewed by:

*Chenguang Yang, University of the West of England, United Kingdom Zhihao Xu, Guangdong Institute of Intelligent Manufacturing, Guangdong Academy of Sciences, China*

#### \*Correspondence:

*Hong Qiao hong.qiao@ia.ac.cn*

Received: *12 April 2019* Accepted: *15 July 2019* Published: *31 July 2019*

#### Citation:

*Zhou J, Chen J, Deng H and Qiao H (2019) From Rough to Precise: Human-Inspired Phased Target Learning Framework for Redundant Musculoskeletal Systems. Front. Neurorobot. 13:61. doi: 10.3389/fnbot.2019.00061*

As a multibody mechanical system (Stoianovici and Hurmuzlu, 1996; Shi and McPhee, 2000) comprising muscles and joints, the human musculoskeletal system has several advantages. For instance, muscle redundancy maintains the reliable operation of the musculoskeletal system when some muscles are fatigued or even damaged. Under control of the central nervous system, the musculoskeletal system can accomplish accurate and fine manipulation (Rasmussen et al., 2001; Chen et al., 2018). To unveil the mechanisms that provide such advantages, Hill studied the contraction properties of muscles, establishing the Hill model (Hill, 1938). From this fundamental work, a series of muscle dynamic models have been proposed (Huxley and Niedergerke, 1954; Eisenberg et al., 1980; Zahalak and Ma, 1990), but all of them present specific limitations. For instance, the simple second-order model (Cook and Stark, 1968; Agarwal et al., 1970) lacks independent nodal locations for external input signals, which indirectly affect the output. The Huxley contraction model (Huxley, 1957) is highly complex and no general-purpose method has been developed to obtain its parameters (Winters and Stark, 1987). The Hill model presents difficulties in measuring the fiber length during motion (Arnold and Delp, 2011).

Research has also been devoted to design hardware for emulating muscle characteristics. The Anthrob muscle unit (Jäntsch et al., 2013) and the sensor–driver integrated muscle module (Asano et al., 2015) try to resemble muscular structures. However, the weight and size of motors make hardware models notably diverge from biological muscles. Furthermore, resembling tiny human muscles through hardware design is difficult, thus undermining their applicability. In materials science, the synthesis of ideal materials for artificial muscles is being pursued to achieve the characteristics of biological muscles regarding size, weight, stiffness, and dynamic behavior. New materials for artificial muscles usually share some problems, including unsafe voltages and low strain. Accessory equipment can partly adjust the characteristics of materials. For instance, liquid-vapor transition has been used on a soft composite material (Miriyev et al., 2017) for implementation as an actuator in a variety of robotic applications. In addition, a coiled polymer muscle (Haines et al., 2014) controlled by varying water temperature prevents dependence on electricity. Hence, advanced design methods and materials seem promising to develop artificial muscles that closely reflect the dynamics of their biological counterparts.

Based on the abovementioned models, control systems developed for musculoskeletal robotics also face challenges. Redundant muscles and extremely complex tendon forces impose several barriers for direct solutions of muscle excitation. Widely used methods, such as inverse dynamics with static optimization (Crowninshield and Brand, 1981), computed muscle control (Thelen et al., 2003), proportional-derivative control (Jagodnik and van den Bogert, 2010), and PI-type iterative learning control (Tahara and Kino, 2010), are used to regulate musculoskeletal systems. Although some conventional methods, such as computed muscle control, theoretically compute muscle excitation signals, they also demand intensive computations for sophisticated processes (Chen et al., 2018). In addition, these control strategies are hardly supported by biological evidence showing that they resemble the approach of human motion learning.

In recent years, reinforcement learning has become a popular control method in robotics as it provides a natural-like approach to learn from the environment. In fact, as a method that fosters interaction with uncertain environments, reinforcement learning allows a learner to observe the environment and then execute appropriate actions. The environment provides rewards for each action, and the learner aims to maximize its rewards during decision-making. This learning process is similar to that of humans and animals (Sutton and Barto, 2018). Studies in neuroscience (Schultz et al., 1997; Law and Gold, 2009) verify this principle, and hence it is reasonable to consider humanlike learning from the viewpoint of reinforcement learning (Tesauro, 1995; Diuk et al., 2008; Riedmiller et al., 2009). Deep neural networks are adopted to implement reinforcement learning. Specifically, the deep Q-network (Mnih et al., 2015) uses a deep convolutional neural network to estimate the action-value function, making deep reinforcement learning a powerful weapon for a myriad of applications (Van Hasselt et al., 2016; Wang et al., 2016; Hou et al., 2017). However, when applied to the musculoskeletal system, the performances of deep neural networks can be unstable. Given muscle redundancy in the musculoskeletal system, the additional dimensions expand the solution space, hindering optimization through reinforcement learning.

In this study, we focused on the unstable training of musculoskeletal systems and the expanded solution space of excitations to provide three contributions. (1) The learning goal of humans, changes stepwise as learning proceeds over advancing levels. For example, running requires higher physical coordination than walking, and one cannot run before learning to walk. Thereby, the learner target evolves from walking to running during this process. Based on this principle, we propose the phased target learning (PTL) framework that reduces the computational cost for exploration in a highdimensional solution space. In addition, phased targets guide the convergence of excitations to the expected value during training. (2) As sensory information may be encoded by opposite tuning neurons (Romo and de Lafuente, 2013), we improve an MLP-based Q-network by introducing an extra layer of neurons reflecting preference and using various relative action probabilities from value functions for obtaining continuous outputs to control a musculoskeletal arm model. (3) As noise exists in the nervous system (A Aldo et al., 2008) and based on information transmission in the human nervous system (Dhawale et al., 2017), we introduce two noise sources at the sensor and execution levels into the proposed PTL framework. These noise sources increase the exploration capacity in the solution space during training and strengthen the control robustness.

In this paper, in section 2, we introduce the muscle dynamics, the structure of the arm model, and detail the musculoskeletal system considered in this study. Moreover, optimization of the proposed PTL framework is outlined. Then, the PTL framework with the biological noise sources is introduced in section 3.

Experimental results and conclusions are presented in sections 4 and 5, respectively.

#### 2. MUSCULOTENDON MODEL AND MUSCULOSKELETAL ARM MODEL

Modeling muscles is difficult because most parameters cannot be measured precisely in real time (Arnold and Delp, 2011). According to the Hill model (Hill, 1938), which defines that a muscle is made up of separate elements, such as contractile elements (CE), passive elements (PE), and series elastic elements (SEE) (Zajac, 1989; Thelen et al., 2003), we design a control framework for musculoskeletal systems.

#### 2.1. Musculotendon Model

To determine the way a human can control complex muscle systems, a muscle dynamic model is necessary. Let u ∈ [0, 1] denote an idealized muscle excitation signal. According to a nonlinear first-order differential Equation (1), muscle activation signal a can be computed (Thelen, 2003):

$$\frac{da}{dt} = \frac{u - \hat{a}}{\tau(u, a)}\,,\tag{1}$$

where τ varies according to idealized muscle excitation signal u and activation signal a (Winters, 1995), aˆ is the activation signal after normalization, and a is transmitted to the muscle contraction dynamic model as a final control signal.

Before introducing the muscle contraction dynamics, the structure of a Hill-type muscle model is shown in **Figure 1**, where l T and l <sup>M</sup> are the lengths of the tendon and muscle fiber, respectively, and α is the muscle pennation angle (Garner and Pandy, 2003). When the activation signal a is transmitted to the muscle, the corresponding muscle force is generated by contraction. Then, the muscle force pulls the skeletons to generate motion or to maintain the balance of forces.

Suppose that signal u is known. To calculate tendon force F T , some assumptions are required. First, F T , F CE , F PE > 0 because muscles move the skeleton by tension instead of thrusting. Second, the change of muscle width can be ignored during muscle contraction (Matthew et al., 2013). Third, muscle mass can be ignored. Using these assumptions, the dynamics of muscles can be described. Specifically, a pennation angle α can be obtained from

$$l\_s^M \sin(\alpha\_0) = l^M(t) \sin(\alpha(t)) \,, \tag{2}$$

where l M s and α<sup>0</sup> are the slack length of a muscle fiber and initial pennation angle, respectively, which also define the initial muscle width, l <sup>M</sup>(t) and α(t) are the length of the muscle fiber and pennation angle at time t, respectively. From α(t), tendon force F T can be computed by a piecewise nonlinear equation (Proske and Morgan, 1987; Thelen, 2003). In addition, the contraction velocity of a muscle fiber is necessary for the model. To determine this velocity, active force F CE produced by the contractile element should be obtained first. According to the geometric relationship between tendon and muscle fiber (**Figure 1**), F CE can be calculated indirectly as follows:

$$F^{CE} = \frac{F^T}{\cos(\alpha)} - F^{PE},\tag{3}$$

where F PE is the passive force of the muscle fiber. During simulations, the muscle length sometimes causes numerical problems that result in F CE < 0, which clearly violate the first assumption about muscles. Therefore, a constraint should be added to avoid exceptional cases:

$$F^{\rm CE} = \max\{F^{\rm CE}, 0\} \,. \tag{4}$$

Then, contraction velocity v <sup>M</sup> can be computed by another piecewise non-linear equation (Matthew et al., 2013):

$$\nu^M = f\_\nu^{-1}\left(\frac{F^{CE}}{af\_l(l^M)}\right),\tag{5}$$

where f<sup>v</sup> is the force–velocity function, f −1 v is its inverse function, and f<sup>l</sup> is a Gaussian function with variable l <sup>M</sup> (Winters, 1990). As a key variable in the muscle dynamics model, v <sup>M</sup>(t) affects l <sup>M</sup>(t +1) at every timestep. Variable l <sup>M</sup> is the fiber length and l MT is the muscle length, which comprises fiber and tendon. Length l <sup>M</sup> can be calculated directly using v <sup>M</sup> and F T , whereas l MT can be measured. Consequently, if signal u(t) is known, the contraction states of the muscle and tendon force F T (t) can be computed.

#### 2.2. Musculoskeletal Arm Model

In the remainder of this section, we first establish a simplified arm model to connect muscles and bones. Then, we analyze the kinematic relationship between the arm model and muscle model. Finally, a control framework is outlined using this relationship.

According to the Newton–Euler equation (Zixing, 2000; Hahn, 2013), we establish a two degree-of-freedom model (**Figure 2**) that consists of two segments and four muscles. Then, expected torque τ<sup>n</sup> at the joints can be calculated as

$$\pi\_n = \frac{\partial \, W}{\partial \theta} = M(\theta)\ddot{\theta} + \mathcal{C}(\theta, \dot{\theta})\dot{\theta} + \mathcal{G}(\theta) \, , \tag{6}$$

where W is the work from external forces, θ˙ is the vector of rotational velocity, θ¨ is the vector of rotational acceleration, M(θ) ∈ R n×n and C(θ, θ˙)θ˙ ∈ R n is the inertia matrix and the centripetal and Coriolis force, respectively, and G(θ) ∈ R n is the gravitational force vector of our model. During forward calculation, Equation (6) provides a way to compute expected torques for known motion states. During inverse calculation, it can be used to compute actual angular acceleration.

#### 2.3. Musculotendon Model Into Arm Model

In this section, we obtain the relationship between torques and motion states and define the adopted learning approach.

Unlike conventional robots that use a single joint motor to generate torque, each joint in a musculoskeletal system is usually affected by more than one muscle. Let τ<sup>i</sup> be the muscle torque generated by muscle i:

$$\pi\_i = F\_i^T l\_{i2} \sin \chi\_i, \ i = 1, 2, \ldots, n \ , \tag{7}$$

where F T i is the tendon force of muscle i and γ<sup>i</sup> is the angle between the muscle and related bone. **Figure 3** provides geometric details of the muscles and bones. We set m<sup>1</sup> = 2 and d<sup>1</sup> = 0.3 as the mass and length of the upper arm, respectively, whereas m<sup>2</sup> = 1.8 and d<sup>2</sup> = 0.3 are the mass and length of the forearm, respectively. For the given geometry of the musculoskeletal model, the muscle torque can be written as

$$\begin{cases} \mathbf{r}'\_{n1} = \mathbf{r}\_1 - \mathbf{r}\_2 = F\_1^T l\_{12} \sin \varphi\_1 - F\_2^T l\_{22} \sin \varphi\_2\\ \mathbf{r}'\_{n2} = \mathbf{r}\_3 - \mathbf{r}\_4 = F\_3^T l\_{32} \sin \varphi\_3 - F\_4^T l\_{42} \sin \varphi\_4 \end{cases} . \tag{8}$$

In addition, the geometric parameters can be used to compute sin γ<sup>i</sup> :

$$\begin{cases} \sin\varphi\_1 = \begin{cases} \sin\varphi\_1 = \frac{l\_{11}\cos\theta\_1}{\sqrt{l\_{11}^2 + l\_{12}^2 - 2l\_{11}l\_{12}\sin\theta\_1}}\\ \sin\varphi\_2 = \begin{cases} \frac{l\_{21}^2 + l\_{22}^2 - 2l\_{21}l\_{12}\cos\theta\_1}{\sqrt{l\_{21}^2 + l\_{22}^2 - 2l\_{21}l\_{22}\cos\theta\_1}}\\ \frac{d\_{33}\cos\frac{\theta\_2}{2}}{\sqrt{l\_{32}^2 + d\_{33}^2 - 2l\_{32}d\_{33}\sin\frac{\theta\_2}{2}}} \end{cases} \\ \sin\varphi\_4 = \begin{cases} \frac{d\_{43}\cos\frac{\theta\_2}{2}}{\sqrt{l\_{42}^2 + d\_{43}^2 + 2l\_{42}d\_{43}\sin\frac{\theta\_2}{2}}} \end{cases} \end{cases} \tag{9}$$

In muscles 3 and 4 (**Figure 2**), we introduce two turning points at the angular bisector of the elbow to design polyline muscles, where d<sup>33</sup> and d<sup>43</sup> are the distances from the elbow to the turning points of muscles 3 and 4, respectively. From Equation (9), it is clear that sin γ<sup>i</sup> is a nonlinear function of θ<sup>i</sup> . By substituting Equation (9) into Equation (8), we obtain muscle torque functions τ ′ n1 (F T 1 , F T 2 , θ1) and τ ′ n2 (F T 3 , F T 4 , θ2).

For the muscle description in our arm model, it is difficult to determine its inverse function, because F T and Equation (5) are piecewise functions with complicated expressions. Therefore, we usually cannot calculate u<sup>i</sup> by directly using muscle force, but instead we adopt an indirect method.

We assume that expected states θ<sup>i</sup> , θ˙ <sup>i</sup> and θ¨ <sup>i</sup> are given. Expected torque τ<sup>n</sup> can be calculated by Equation (6) as a learning target. On the other hand, actual tendon force F T is known when corresponding excitation signals u are generated, and hence actual torque τ ′ <sup>n</sup> is calculable. To obtain actual angular accelerations θ¨ i , Equation (6) can be computed reversely. In general, θ¨ can be rewritten as θ¨(τn, θ, θ˙). Considering θ˙ = dθ dt and θ¨ = dθ˙ dt , joint angle θ at time (t + 1) can be obtained as

$$\left(\theta\_{t+1}(\mathfrak{r}\_n, \theta\_t(\check{\theta}\_{t-1}), \dot{\theta}\_t(\check{\theta}\_{t-1}), \ddot{\theta}\_t)\right). \tag{10}$$

If tendon force vector F T satisfies

$$\left(\tau\_n(\theta,\dot{\theta},\ddot{\theta}) = \tau\_n'(F^T,\theta)\right),\tag{11}$$

we can rewrite Equation (11) as

$$
\left(\partial\_{t+1}\left(\pi\_n'(F\_t^T,\theta\_t),\theta\_t(\ddot{\theta}\_{t-1}),\dot{\theta}\_t(\ddot{\theta}\_{t-1}),\ddot{\theta}\_t\right)\right).\tag{12}
$$

The purpose of our framework is to find appropriate excitation signals u to generate tendon forces that satisfy Equation (11). As a result, the expected motions will be generated during exploration. Based on Equation (12), we establish a training framework for the musculoskeletal arm model. When excitation signal u is given, corresponding activation signal a and tendon force F T can be calculated by muscle dynamics. Then, new motion states can be solved using the arm model. If excitation signal u is unknown, we should explore candidate solutions to generate F T satisfying (Equation 11).

#### 3. HUMAN-INSPIRED PHASED TARGET LEARNING FRAMEWORK

We design a learning framework to solve signal u<sup>i</sup> . Conventional learning frameworks use expected states as the learning target. However, these targets can cause unforeseen problems during the solving process, and solutions can fall into local optima. In contrast, the proposed PTL framework can avoid local optima by guiding the learning process. Specifically, different learning targets are designed according to the learner's level, additionally providing high efficiency during training. We consider the musculoskeletal system, optimization model, and expected target state as the most essential aspects in our framework (**Figure 4**) and detail the last two parts in the sequel.

#### 3.1. Phased Target Learning 3.1.1. Simplified Target Setup

Consider a beginner who starts to learn dancing or practicing a sport. It is difficult for him to acquire all the professional postures and skills at once. Instead of trying to enhance memory or learning skills, the simplest solution is reducing the quality requirements and perform intensive practice through gradual improvement. This way, the beginner will easily improve by establishing simple learning targets that are gradually set at different levels as learning proceeds. In this study, we calculated precise motion states to be expected targets. Then, we designed different simplified states as easier targets for learning. Formally, let s ∈ S be the expected states of the arm model, and s<sup>T</sup> ∈ S<sup>T</sup> be the simplified states. s<sup>T</sup> can be calculated by simplifying s:

$$s\_T(t) = s(cell(\frac{t}{d}) \cdot d) \cdot \delta(0), t = 1, 2, \dots, T,\tag{13}$$

where δ(t) is an impulse function, and d ∈ N<sup>+</sup> satisfying <sup>d</sup> T ∈ [ 1 T , 1] is a forgetting factor. When d = T, s<sup>T</sup> only reflects the endpoint state of expected state s, and when d = 1, s<sup>T</sup> = s, indicating that s<sup>T</sup> reflects all the states of S.

Obviously, simplification induces errors with respect to expected states. Suppose that θ ∈ S is the expected joint angle of the arm model, and θ<sup>T</sup> ∈ S<sup>T</sup> is the simplified joint angle. Then, we define the average allowed error between s and s<sup>T</sup> as

$$\overline{\boldsymbol{e}}\_{T} = \frac{1}{T} \sum\_{t=1}^{T} \left| \boldsymbol{\theta}(t) - \boldsymbol{\theta}\_{T}(t) \right| \,. \tag{14}$$

According to Equations (13) and (14), average allowed error e<sup>T</sup> depends only on the forgetting factor d. Geometrically, e<sup>T</sup> can be considered as the width of the equivalent error region. **Figure 5** shows the width and effect of d on simplified joint angle curve θT.

PTL provides different simplified targets for learning at varying training phases. When the motion accuracy achieves the average allowed error range, eT, a new and smaller average allowed error range is given to guide training. Then, we define actual average error e<sup>R</sup> of motion as

$$\overline{e}\_{\mathbb{R}} = \frac{1}{T} \sum\_{t=1}^{T} |\theta\_T(t) - \theta\_\mathbb{R}(t)|\,. \tag{15}$$

Unlike Equation (14), Equation (15) uses the actual joint angle, θR. In addition, e<sup>T</sup> is updated after each training iteration. A new average allowed error is computed only when

$$
\overline{e}\_T - \overline{e}\_R > 0 \,. \tag{16}
$$

$$d = \begin{cases} D(d), & \overline{e}\_T - \overline{e}\_R > 0, D(d) \gg 1; \\ 1, & \overline{e}\_T - \overline{e}\_R > 0, D(d) < 1; \\ d, & \overline{e}\_T - \overline{e}\_R \lesssim 0; \end{cases} \tag{17}$$

Equation (17) is the update rule of forgetting factor d, where D(d) is a function that satisfies D(d) < d. It is convenient to maintain the value of |d − D(d)| small, because a large difference between adjacent simplified states vanishes the gradual learning effect.

FIGURE 4 | Schematic of proposed PTL framework for motion control of musculoskeletal robots. A vision sensor collects motion information. Then, visual stimuli are transmitted to the optimization model and performance evaluation module. During optimization, state information is processed by a multi-layer perceptron. Then, perceptual decisions (excitation signals) are transmitted to the arm model as optimization results. During performance evaluation, different phased targets are designed to guide arm motion states. Finally, the evaluation results are transmitted to the optimization model for improved decision-making. In addition, two biological noise sources are considered during learning for improved exploration ability.

#### 3.1.2. Performance Evaluation Function

Conventional temporal-difference learning methods are highly suitable for model-free learning. Considering Equation (11), the inverse function of τ ′ n should be determined and can be set as a model-free problem. In this study, we aimed to improve the Q-network to estimate the continuous excitation signal u for musculoskeletal systems. Then, we combined it with PTL to calculate appropriate control signals.

Let T be the number of finite timesteps and u<sup>i</sup> be the excitation signal for muscle i. Each signal ui(t) at time t has two possible actions; either increase [ai,1(t)] or decrease [ai,2(t)]. The adjustment of u<sup>i</sup> affects the muscle and musculoskeletal model at time (t + 1).

However, the two actions only determine the increment sign, and additional parameters are required to calculate the step sizes. Furthermore, the difference between adjacent states can hinder perceptron learning from input states during training. Moreover, incorrect adjustments can lead to signal oscillation in the redundant musculoskeletal model.

In human cortical circuits, sensory information is encoded by neurons via opposite tuning (Romo and de Lafuente, 2013). Based on this mechanism, we redefine action-value function Qu<sup>i</sup> ,j

as a probability of signal u<sup>i</sup> executing action ai,<sup>j</sup> . Equation (18) defines u<sup>i</sup> as

$$u\_i = \frac{1}{\sum\_{j=1}^{2} Q\_{u\_i j}} (Q\_{u\_i, 1} u\_{\text{max}} + Q\_{u\_i, 2} u\_{\text{min}}), \ i = 1, 2, \dots, n \tag{18}$$

and action-value function Qu<sup>i</sup> ,j is redefined as

$$Q\_{\boldsymbol{u}\_{i}\boldsymbol{j}}(\boldsymbol{s}\_{t},\boldsymbol{a}\_{i,\boldsymbol{j},t}) = \mathbb{E}\left[E\_{\boldsymbol{u}\_{i}}(\boldsymbol{s}\_{t+1},\boldsymbol{a}\_{i,\boldsymbol{j},t+1}) + \gamma Q\_{\boldsymbol{u}\_{i}\boldsymbol{j}}(\boldsymbol{s}\_{t+1},\boldsymbol{a}\_{i,\boldsymbol{j},t+1})\right],$$

$$\boldsymbol{i} = 1,2,...,\boldsymbol{n};\;\boldsymbol{j} = 1,2,\tag{19}$$

where Eu<sup>i</sup> is an evaluation function related to arm motion. According to Equations (18) and (19), a specific action value of a function is not enough to obtain the excitation signal in our method. Instead, relative values of different functions determine an excitation signal, and thus Qu<sup>i</sup> ,1 and Qu<sup>i</sup> ,2 should be maintained balanced. In addition, note that Eu<sup>i</sup> is used in Equation (19) instead of conventional reward function Ru<sup>i</sup> . This is because the Ru<sup>i</sup> is a decreasing function of the action error, and during training, reducing action errors increases Ru<sup>i</sup> and Qu<sup>i</sup> ,j . In this case, the balance of action-value functions is affected by increasing Qu<sup>i</sup> ,j . Therefore, we employ evaluation function Eu<sup>i</sup> , which is an increasing function of the action error. Reducing errors therefore imply smaller Eu<sup>i</sup> and a weaker effect than Ru<sup>i</sup> on the balance of action-value functions. Furthermore, (Eu<sup>i</sup> )min > 0 promotes stability, as detailed in section 3.1.3.

We obtain the performance evaluation function as follows:

$$E\_{\mu\_l}(e\_{\mathbb{R}}) = p \cdot \exp\left[m \cdot g^2(e\_{\mathbb{R}})\right] + k \tag{20}$$

$$g(e\_{\mathbb{R}}) = \min\left[|e\_{\mathbb{R}}|, e\_0\right] \,, \tag{21}$$

where p, m, k > 0 are parameters of Eu<sup>i</sup> and function g(eR) prevents exploding gradients under large errors.

#### 3.1.3. Learning by Gradient Descent

We define the loss function by summing the squared errors between expected action value Q ′ ui ,j and actual action value Eui ,<sup>j</sup> + γ Q ′ ui ,j :

$$L(\theta) = \frac{1}{2} \mathbb{E} \left[ \sum\_{i=1}^{n} \sum\_{j=1}^{2} \left( E\_{u\_{i}j} + \nu Q'\_{u\_{i}j} \langle s', a'; \theta' \rangle - Q\_{u\_{i}j} \langle s, a; \theta \rangle \right)^{2} \right], \tag{22}$$

where γ is a factor to discount the future action value. The gradient of the loss function is given by

$$\nabla L(\boldsymbol{\theta}) = \mathbb{E}\left[\sum\_{i=1}^{n} \sum\_{j=1}^{2} \mathcal{V}\left(E\_{u\_{i}j} + \mathcal{V}Q\_{u\_{i}j}'(s', a'; \boldsymbol{\theta}') - Q\_{u\_{i}j}(s, a; \boldsymbol{\theta})\right)\right]$$

$$\nabla Q\_{u\_{i}j}'(s', a'; \boldsymbol{\theta}')\Big|\,. \tag{23}$$

During backpropagation, the outputs of multi- layer perceptron in our model can be easily obtained. We suppose that Q ′ ui ,j represents the result of the output layer and can be expressed as

$$Q'\_{u\_i j} = f\left(\sum\_{h=1}^{n\_h} \omega\_{hk} y\_h\right),\tag{24}$$

where f(x) is the sigmoid activation function, ωhk is the weight of the edge from the h-th node in the hidden layer to the k-th node in the output layer. Consider ωhk as an example, the weight increment is given by

$$
\Delta \alpha \rho\_{hk} = -\eta \frac{\partial L}{\partial \alpha \rho\_{hk}} \tag{25}
$$

$$=-\eta \sum\_{i=1}^{n} \sum\_{j=1}^{2} \chi \left( E\_{u\_{i\cdot j}} + \chi Q\_{u\_{i\cdot j}}' - Q\_{u\_{i\cdot j}} \right) \frac{\partial Q\_{u\_{i\cdot j}}'}{\partial \alpha \eta\_{lk}} \tag{26}$$

$$\eta = -\eta \sum\_{i=1}^{n} \sum\_{j=1}^{2} \chi \left( E\_{\underline{n},j} + \chi Q\_{\underline{n},j}' - Q\_{\underline{n},j} \right) f' \left( \sum\_{h=1}^{n\_b} \alpha\_{\rm hk} \eta\_h \right) \sum\_{h=1}^{n\_b} \eta\_h,\tag{27}$$

where y<sup>h</sup> is the output of the h-th node in the hidden layer. When the excitations become stable, the expected increment is 1ωhk → 0 such that 1Qu<sup>i</sup> ,<sup>j</sup> → 0, and hence Eu<sup>i</sup> ,<sup>j</sup> +γ Q ′ ui ,<sup>j</sup> = Qu<sup>i</sup> ,j at this time. Factor γ is known as a decimal, and we can infer γ Q ′ ui ,<sup>j</sup> < Qu<sup>i</sup> ,j , which explains why the performance evaluation function should satisfy (Eu<sup>i</sup> ,j)min > 0.

#### 3.2. Noise in Nervous System

Noise is ubiquitous in real-world systems, especially during information transmission. As motion learning consists of information transmission, noise is present. Recent research roughly identified noise sources in the nervous system at the sensor and action levels (A Aldo et al., 2008). We considered these noise sources in the proposed PTL framework.

#### 3.2.1. Noise at Sensor Level

During the collection of visual information, photoreceptors receive photons reflected by objects under the influence of Poisson noise, which reduces the accuracy of optical information (Bialek, 1987). Although sensory noise is inevitable (Bialek and Setayeshgar, 2005), it also mitigates sensitivity of the redundant musculoskeletal system.

When motion tracking is performed on the redundant musculoskeletal arm model, the Q-network method can exhibit unstable training, because joint angles are affected by the action of many muscles, likely falling into local optima. Then, any small fluctuation of excitation signals can be amplified and cause divergent signals. However, when target motion is considered as a region, fluctuations are tolerated. We use Poisson noise to conform tolerance regions and prevent rapid fluctuations:

$$s\_{\mathbb{R}\mathbb{N}} = s\_{\mathbb{R}} + N\_1 \,, \tag{28}$$

$$N\_1 \sim Pois(\lambda)\,,\tag{29}$$

where s<sup>R</sup> is the actual arm state, sRN is the observed arm state observed by the vision sensor, and N<sup>1</sup> is Poisson noise in the visual information. In our algorithm, let s<sup>R</sup> = sRN represents the inputs of the improved Q-network.

#### 3.2.2. Noise at Execution Level

Noise at the sensor level is also called planning noise, as it affects decision-making. In addition, execution noise exists and is superimposed on the original decision signals. In fact, execution noise describes an uncontrollable noise whose standard deviation is linearly related to the mean muscle force (Hamilton et al., 2004; Dhawale et al., 2017) and can be expressed as

$$
\mu\_{\text{Ni}} = \min\left[\max\left[\mu\_{\text{i}} + N\_{\text{2}}, 0\right], 1\right], \tag{30}
$$

where N<sup>2</sup> ∼ N(0, (vF<sup>T</sup> ) 2 ) simulates noise in the motor system periphery, u<sup>i</sup> and uNi are undisturbed and noisy signals from perceptron, respectively, and v is a scale coefficient of tendon force F T . Note that the square of vF<sup>T</sup> defines the variance of execution noise, and like noise in sensor level, let u<sup>i</sup> = uNi represent the final outputs of the proposed network.

#### 4. SIMULATION EXPERIMENTS

We conducted simulation experiments on the musculoskeletal system model to verify the performance of different algorithms. Moreover, the equilibria of action values are analyzed to explain the learning process of the proposed PTL framework.

#### 4.1. Experimental Setup

As mentioned above, we designed a simplified musculoskeletal arm model to verify and evaluate the proposed learning method. After analyzing its dynamics (Equation 12), a basic control framework is devised. To validate the formulation and analyze performance, optimization should be performed.

In this study, the proposed PTL is applied to a point-topoint motion task with constant angular velocity as temporaldifference learning approach. For a final state of target motion, we calculated midpoints and required constraints using inverse kinematics. Then, we used joint angles as motion states to design the simplified target states. Assuming a constant angular velocity, four types of control strategies were evaluated: (1) Q-network, (2) Q-network with noises, (3) PTL, and (4) PTL with noises. The implemented method including PTL is detailed in Algorithm 1.

We set maximum number of iterations K = 500 and number of timesteps T = 10, 000 to simulate 10 s. All the errors and control signals were recorded at each timestep.

#### 4.2. Results and Analysis

We considered average error e = 1 T P<sup>T</sup> t=1 |θ(t) − θR(t)| as a key performance indicator, where θ(t) is the precise expected joint angle at time t. As e reflects the average error, motion performance can be evaluated from this measure.

**Figure 6** shows the average error e according to iteration k. Clearly, the Q-network method, Q-network with noises, and PTL are trapped at local optima and unstable during training. Still, phased targets improve learning by increasing the randomness of exploration, and noises during training enhance fault tolerance and the exploration ability during control.

Assume that the ratio of action-value functions is convergent to local optimum b<sup>i</sup> , which is defined as

$$b\_i = \frac{Q\_{u\_i,1}}{Q\_{u\_i,2}}\,. \tag{31}$$

**Algorithm 1:** PTL with Noises for Motion Learning in Musculoskeletal System.

	- 1: **for** k=1 **to** K **do**

```
6: end if
```

17: **end for**

Then, u<sup>i</sup> can be rewritten as

$$\mu\_i = \frac{1}{\sum\_{j=1}^2 Q\_{\mu\_i j}} (Q\_{\mu\_i, 1} u\_{\text{max}} + Q\_{\mu\_i, 2} u\_{\text{min}}) \tag{32}$$

$$=\frac{1}{b\_i+1}(b\_i\mu\_{\text{max}}+\mu\_{\text{min}})\,,\tag{33}$$

and hence the equilibrium point b<sup>i</sup> is the only parameter that affects excitation signal u<sup>i</sup> . We prescribe that the control method adjusts Qu<sup>i</sup> ,1 and Qu<sup>i</sup> ,2 in an opposite way. In addition, increment 1Qu<sup>i</sup> ,<sup>j</sup> satisfies 1Qu<sup>i</sup> ,<sup>j</sup> > 0 and 1Qu<sup>i</sup> ,<sup>j</sup> ≪ Qu<sup>i</sup> ,<sup>j</sup> at simulation onset. The next equilibrium point at time (t + 1) is b ′ <sup>i</sup> = (Qu<sup>i</sup> ,1 ∓ 1Qu<sup>i</sup> ,1)/(Qu<sup>i</sup> ,2 ± 1Qu<sup>i</sup> ,2), whose increment is given by

$$b\_i - b\_i' = \frac{Q\_{u\_i,1}}{Q\_{u\_i,2}} - \frac{Q\_{u\_i,1} \mp \Delta Q\_{u\_i,1}}{Q\_{u\_i,2} \pm \Delta Q\_{u\_i,2}},\tag{34}$$

$$=\frac{\pm(Q\_{\boldsymbol{\mu}\_{i},1}\Delta Q\_{\boldsymbol{\mu}\_{i},2} + \Delta Q\_{\boldsymbol{\mu}\_{i},1}Q\_{\boldsymbol{\mu}\_{i},2})}{Q\_{\boldsymbol{\mu}\_{i},2}(Q\_{\boldsymbol{\mu}\_{i},2} \pm \Delta Q\_{\boldsymbol{\mu}\_{i},2})}\,. \tag{35}$$

For (−1Qu<sup>i</sup> ,1, +1Qu<sup>i</sup> ,2), we obtain b<sup>i</sup> − b ′ <sup>i</sup> > 0, and excitation signal u<sup>i</sup> becomes smaller. For (+1Qu<sup>i</sup> ,1, −1Qu<sup>i</sup> ,2), as Qu<sup>i</sup> ,2 − 1Qu<sup>i</sup> ,2 > 0, we obtain b<sup>i</sup> − b ′ <sup>i</sup> < 0, and excitation signal u<sup>i</sup> becomes larger.

However, with reducing motion error, the increment of function Qu<sup>i</sup> ,j is smaller for Qu<sup>i</sup> ,<sup>j</sup> ≈ 1Qu<sup>i</sup> ,j . From Equation

(35), when (+1Qu<sup>i</sup> ,1, −1Qu<sup>i</sup> ,2), the sign of (b<sup>i</sup> − b ′ i ) depends on the sign of (Qu<sup>i</sup> ,2 − 1Qu<sup>i</sup> ,2). Nevertheless, it is difficult to guarantee either (Qu<sup>i</sup> ,2 6 1Qu<sup>i</sup> ,2) or (Qu<sup>i</sup> ,2 > 1Qu<sup>i</sup> ,2). The uncertain sign causes chattering on the excitation signal (Equation 33), which can cause signal divergence at the final state.

In addition, random factors like ǫ and noise can give rise to fluctuations of 1Qu<sup>i</sup> ,j , which may increase the adjustment extent. For example, if (+1Qu<sup>i</sup> ,1, +1Qu<sup>i</sup> ,2) or (−1Qu<sup>i</sup> ,1, −1Qu<sup>i</sup> ,2), the increment of b<sup>i</sup> is given by

$$b\_i - b\_i' = \frac{Q\_{\mu\_{i,1}}}{Q\_{\mu\_{i,2}}} - \frac{Q\_{\mu\_{i,1}} \pm \Delta Q\_{\mu\_{i,1}}}{Q\_{\mu\_{i,2}} \pm \Delta Q\_{\mu\_{i,2}}} \tag{36}$$

$$=\frac{\pm(Q\_{\boldsymbol{\mu}\_{i},1}\Delta Q\_{\boldsymbol{\mu}\_{i},2} - \Delta Q\_{\boldsymbol{\mu}\_{i},1}Q\_{\boldsymbol{\mu}\_{i},2})}{Q\_{\boldsymbol{\mu}\_{i},2}(Q\_{\boldsymbol{\mu}\_{i},2} \pm \Delta Q\_{\boldsymbol{\mu}\_{i},2})},\tag{37}$$

where (Qu<sup>i</sup> ,11Qu<sup>i</sup> ,2 − 1Qu<sup>i</sup> ,1Qu<sup>i</sup> ,2) with an uncertain sign can seriously undermine performance, as it is directly related to the sign of (b<sup>i</sup> − b ′ i ). Furthermore, performance may decay even without condition Qu<sup>i</sup> ,<sup>j</sup> ≈ 1Qu<sup>i</sup> ,j , and the method will be unreliable under its influence. Fortunately, with appropriate training, performance degradation by random effects can almost be eliminated.

Another problem is early convergence during learning. **Figure 7** shows the evolution of the average allowed error. The four evaluated methods terminate searching when reaching different local optima. Generally, premature convergence occurs through the insufficient exploration of solutions. Given its exploration ability, the proposed PTL with noises was guided by simplified targets to avoid premature convergence. This method achieved the lowest error (average e < 0.746cm) and the most advanced learning level throughout repeated experiments.

$$b\_i = \frac{Q\_{\mu\_i,1}}{Q\_{\mu\_i,2}} + \Delta b\_i \tag{38}$$

We define 1b<sup>i</sup> in Equation (38) as a small increment of the equilibrium point caused by the allowed error eT. As Qui ,1 Qui ,2 is not at the expected equilibrium point b<sup>i</sup> , Qu<sup>i</sup> ,<sup>j</sup> cannot easily generate large fluctuations. According to the analyses above, <sup>Q</sup>ui ,1 Qui ,2 will converge to the final equilibrium point b<sup>i</sup> when t = T.

**Figure 8** shows signal u<sup>i</sup> learned using PTL with noises and the corresponding tendon force, F T i . **Figure 9** shows the final position of the arm and joint angles. These results show that the most substantial errors occur at motion onset, and only slight fluctuations remain afterwards. At motion onset, it is reasonable to believe that unexpected muscle forces, especially passive forces of muscles 1 and 3, disturb the force balance. As the simulation proceeds, the arm model returns to a balance state by adjusting u<sup>i</sup> . Therefore, PTL extends learning and guides toward the next expected solutions. In addition, the noises foster an extensive exploration of the solution space during training.

To further evaluate PTL framework, we consider pointto-point motion through two scenarios. First, motion begins from a stable position (θ<sup>i</sup> = 0) and finishes at another position (**Figure 10**).

When motion starts from a stable position, the next state st+<sup>1</sup> does not considerably change if F <sup>T</sup> = 0. Therefore, the algorithm should not deal with large and rapid fluctuations, and the PTL performance is high. In contrast, in the second scenario, motion starts from an unstable position, and st+<sup>1</sup> exhibits a large difference when compared with s<sup>t</sup> in the initial period

FIGURE 8 | Execution signals trained using PTL with noises after 500 iterations. All excitation signals are filtered with a Butterworth lowpass filter to separate signals from execution noise.

even if F <sup>T</sup> = 0, as the gravitational torque contributes to a large angular acceleration. Consequently, learning is unstable.

The performance in the second scenario (**Figure 11**) confirms our prediction of large initial fluctuations. In fact, inappropriate initial parameters in musculotendon model will also degrade the performance. As inappropriate parameters lead to inappropriate muscle force, and some timesteps are necessary to adjust those parameters. In addition, the trajectory length is notably shorter than that in the first scenario, leading to a shorter trajectory for adjustment and learning. Consequently, errors increase in this scenario.

## 5. CONCLUSIONS

In this paper, we propose a human-inspired motion learning framework for a musculoskeletal system, called PTL. We analyze the learning process and equilibrium point of Qu<sup>i</sup> ,j ,

trajectory task is required a constant angular velocity. (Bottom) Allowed error during training, which can be considered as phased target of motion learning.

tendon forces caused by signal µ. (Middle) Actual joint angles during the motion. Remember that each trajectory task is required a constant angular velocity. (Bottom) Allowed error during training, which can be considered as phased target of motion learning. To evaluate performance, situation (E) is designed particularly to move from a unstable state to the stable position (θ1 = 0 *and* θ2 = 0).

determining that phased targets guide excitation signals toward expected values during learning. Two types of biological noise sources are considered in the PTL framework to increase the exploration ability in an expanded solution space, making the algorithm suitably follow the guidance of phased targets. Theoretically, as PTL is based on a human learning process, it can be expanded as a generalpurpose learning framework if we find appropriate ways to simplify different kinds of tasks, such as capture and pattern recognition tasks.

In future work, we will apply advanced methods in PTL to improve performance, especially when motion starts from an unstable position. Furthermore, better approaches for simplifying tasks and more biological mechanisms of motion control should be investigated to expand the application scope of the PTL framework.

### DATA AVAILABILITY

All datasets generated for this study are included in the manuscript and/or the supplementary files.

## AUTHOR CONTRIBUTIONS

JZ provided the main ideas of this research, wrote the manuscript and codes of experiments. JC and HD provided suggestions about PTL framework. HQ and other authors discussed and revised the manuscript.

## REFERENCES


### FUNDING

This work was supported in part by the National Key Research and Development Program of China (2017YFB1300200, 2017YFB1300203), the National Natural Science Foundation of China under Grants 91648205 and 61627808, the Strategic Priority Research Program of Chinese Academy of Science under Grant XDB32000000, and the development of science and technology of Guangdong Province special fund project under Grant 2016B090910001.


eds J. M. Winters and S. LY. Woo (New York, NY: Springer), 69–93.


Zixing, C. (2000). Robotics. Beijing: Tsinghua University Press.

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zhou, Chen, Deng and Qiao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Robust Event-Based Object Tracking Combining Correlation Filter and CNN Representation

Hongmin Li\* and Luping Shi\*

*Department of Precision Instrument, Center for Brain-Inspired Computing Research, Tsinghua University, Beijing, China*

Object tracking based on the event-based camera or dynamic vision sensor (DVS) remains a challenging task due to the noise events, rapid change of event-stream shape, chaos of complex background textures, and occlusion. To address the challenges, this paper presents a robust event-stream object tracking method based on correlation filter mechanism and convolutional neural network (CNN) representation. In the proposed method, rate coding is used to encode the event-stream object. Feature representations from hierarchical convolutional layers of a pre-trained CNN are used to represent the appearance of the rate encoded event-stream object. Results prove that the proposed method not only achieves good tracking performance in many complicated scenes with noise events, complex background textures, occlusion, and intersected trajectories, but also is robust to variable scale, variable pose, and non-rigid deformations. In addition, the correlation filter-based method has the advantage of high speed. The proposed approach will promote the potential applications of these event-based vision sensors in autonomous driving, robots and many other high-speed scenes.

#### Edited by:

*Guang Chen, Tongji University, China*

#### Reviewed by:

*Liu Xiang Yong, Tongji University, China Rui Yan, Sichuan University, China*

#### \*Correspondence:

*Hongmin Li lihongmin0110@gmail.com Luping Shi lpshi@mail.tsinghua.edu.cn*

Received: *25 March 2019* Accepted: *20 September 2019* Published: *10 October 2019*

#### Citation:

*Li H and Shi L (2019) Robust Event-Based Object Tracking Combining Correlation Filter and CNN Representation. Front. Neurorobot. 13:82. doi: 10.3389/fnbot.2019.00082* Keywords: event-based vision, object tracking, dynamic vision sensor, convolutional neural network, correlation filter

## INTRODUCTION

Different from the traditional frame-based imager, event-based camera, or dynamic vision sensor (DVS) converts the temporal contrast of the light intensity into spatiotemporal, sparse event streams (Lichtsteiner et al., 2008; Serrano-Gotarredona and Linares-Barranco, 2013; Brandli et al., 2014). DVS has the advantages of low information redundancy, high dynamic range, and high speed in visual sensing, and has the potential applications in the high-speed scenes. Recently, DVS has been used for estimating the high-speed optical flow and intensity field (Kim et al., 2008; Bardow et al., 2016). A visual processing system based on event camera demands low energy consumption. The outputted events are represented in the form of Address-Event Representation (AER) (Boahen, 2000). AER is often utilized to model the signal of neural systems, like the retina using discrete time events to convey information, and other spike coded neural systems in living organisms.

Visual tracking has a wide range of applications in the fields of autonomous driving, robot vision, trajectory analysis and so on. When an object is detected at a certain moment, it is often useful to track that object in subsequent recordings. Many works of object tracking based on the retina-inspired DVS sensors have been reported (Litzenberger et al., 2006; Conradt et al., 2009; Schraml and Belbachir, 2010; Drazen et al., 2011; Ni et al., 2012, 2015; Piatkowska et al., 2012; Zhenjiang et al., 2012; Saner et al., 2014; Zhu et al., 2017). However, event-based object tracking is challenging due to the significant appearance variations caused by the noise events, complex background textures, occlusion and randomness of event generating in each pixel circuit. Firstly, events stimulated by the contour and textures of the object are easy to be confused by events from the background objects. If the target event-stream object has a similar spatiotemporal shape with that of a background object, a tracker based on simple feature representation would be easily confused. Besides, the eventstream shape of the non-rigid object changes and deforms all the time, which demands a more discriminative feature representation. **Figure 1** shows some successive reconstructed frames from several event-stream recordings. The ground-truth position of the target object is shown with a bounding box in each segment. From the pictures, we can see that the appearance of the target event-stream object changes obviously even between two adjoining segments, which demands a robust tracker for tracking the rapid changed appearance.

This paper presents a robust event-stream pattern tracking method based on correlation filter (CF) mechanism. Hierarchical convolutional layers of a convolutional neural network (CNN) are used to extract the feature representation from rate coding frame of event streams. The performance of the proposed method is evaluated on the DVS recordings of several complicated visual scenes. Among the recordings, three are captured by a DVS128 sensor (Lichtsteiner et al., 2008) by ourselves and the rest are from an event-stream tracking dataset (Hu et al., 2016) captured by a DAVIS sensor (Brandli et al., 2014). The results prove that the proposed method can successfully track the objects in some visual scenes with noise events, complicated background textures, occlusion, and intersected trajectories. This is because we use features from multiple CNN layers to represent the appearance of the target object which combines semantics that are robust to significant appearance variations and spatial details that are effective for precise localization.

#### Related Works

Object tracking methods based on event cameras can be classified into two categories. The first category is the eventdriven mechanism in which each incoming event is processed and determined whether it belongs to the target object. In Litzenberger et al. (2006) implemented a continuous clustering of AER events and tracking of clusters. Each new event was assigned to a cluster based on a distance criterion and then the

FIGURE 1 | Some example reconstructed frames from several event-stream recordings (From top to bottom are the "Horse toy" from DVS128 sensor, "Vid\_B\_cup," "Vid\_J\_person\_floor," and "Vid\_E\_person\_part\_occluded" from DAVIS sensor). The appearance of the target object (in bounding box with green line) changes obviously for the noise events, background texture, and randomness in event-generating of the pixels.

clusters weight and center position was updated. In addition, point cloud method is also introduced to model the event-stream object. In Ni et al. (2015) proposed an iterative closest point based tracking method by providing a continuous and iterative estimation of the geometric transformation between the model and the events of the tracked object. In Ni et al. (2012), the authors applied the iterative closest point tracking algorithm to track a microgripper position in an event-based microrobotic system. One disadvantage of these kinds of methods is that noise events occur will cause the tracker to make a wrong inference. Adding noise event filtering modules to the tracking system will unavoidably filter many informative events while increase the computational complexity of the system. In addition, although these event-based sensors are based on the event-driven nature, it is still a difficult task to recognize an object from each single event. The second category is based on feature representation of the target object. In Zhu et al. (2017), the authors proposed a soft data association modeled with probabilities relying on grouping events into a model and computing optical flow after assigning events to the model. In Lagorce et al. (2014), proposed an event-based multi-kernel algorithm, and various kernels, such as Gaussian, Gabor, and arbitrary user-defined kernels were used to handle the variations in position, scale and orientation. In Li et al. (2015), the authors prosed a compressive sensing based method for the robust tracking based on the event camera. The representation or appearance model of event-stream object is based on features extracted from the multi-scale space with a data-independent basis and employs non-adaptive random projections that preserve the structure of the feature space of objects.

The core of most modern trackers is a discriminative classifier to distinguish the target from the surrounding environment. In computer vision, CF based methods has enjoyed great popularity due to the high computational efficiency with the use of fast Fourier transforms. In Bolme et al. (2010) learned a correlation filter over luminance channel the first time for real-time visual tracking, named MOSSE tracker. In Henriques et al. (2012, 2015), a kernelized correlation filter (KCF) is introduced to allow non-linear classification boundaries. Nowadays, features from convolutional neural network (CNN) are used to encode the object appearance and achieved good performance (Danelljan et al., 2015; Ma et al., 2015). In Danelljan et al. (2015), the authors proposed a method combining activations from the convolutional layer of a CNN in discriminative correlation filter based tracking frameworks, achieving a superior performance by using convolutional features compared to standard handcrafted feature representations. They also show that activations from the first layer provides superior tracking performance compared to the deeper layers of the network. In Ma et al. (2015), they exploit the hierarchies of convolutional layers as a nonlinear counterpart of an image pyramid representation and these multiple levels of abstraction to improve tracking accuracy and robustness. They demonstrate that representation by multiple layers of CNN is of great importance as semantics are robust to significant appearance variations and spatial details are effective for precise localization. Although feature-based methods show robustness and real-time capability, the most serious defect is that such algorithms should accumulate the events in a time window and then perform feature extraction. Then the length of the time window may be different under different scenes.

## METHODOLOGY

## Temporal Contrast Pixel

The pixel of the DVS sensor is a type of temporal contrast pixel which only responds to the temporal contrast of the light intensity in the scene and generates a temporal event whenever the brightness change exceeds a pre-defined threshold. Each event is a quadruple (x, y, t, p), where (x, y) denotes the positions of the pixel, t denotes the time when the event is generated, the polarity p = 1 denotes the increasing brightness and p = −1 denotes the decreasing brightness. This temporal contrast pixel has the advantage of high dynamic range because it needs not to respond to the absolute light intensity. The time stamp of each event has the temporal resolution of microsecond. Then the DVS sensor is suitable to capture the dynamic scenes with high-speed changes.

In this work, we use event-stream recordings from both DVS128 and DAVIS sensors to evaluate the performance of the proposed method. Both sensors are based on the same eventgenerating mechanism. As the name of the sensor shows, DVS128 has the spatial resolution of 128 × 128. DAVIS is a new retinainspired, event-based vision sensor with the spatial resolution of 240 × 180.

Although these kinds of sensors are based on an event-driven nature, it remains a difficult task to recognize an object from a single event. Many works have accumulated the event stream into multiple segments on which to extract feature for information processing, such as the event-stream object display in jAER opensource tool. There exist two accumulating methods, i.e., hard events segmentation (HES) and soft events segmentation (SES). HES divides the event flow into segments using fixed time slices or fixed number of events. Different from HES, SES adaptively obtains the segments according to the statistical characteristics of the events based on an event responding neuron, such as the leaky-integration-firing neuron.

For comparison, we segment the event streams into the same number of segments with the items of the groundtruth. Rate coding is utilized to encode the visual information of the eventstream object. Intuitively, each pixel value is represented as the number of events generated by this pixel within the segment. Event rate of the temporal contrast pixel can be represented as follows,

$$\text{Rate(t)} \approx \frac{\text{TCON(t)}}{\theta} = \frac{1}{\theta} \frac{d \ln(I)}{dt} \tag{1}$$

where TCON represents the temporal contrast, and I(t) is the photocurrent. Within a time window, the physical meaning of the number of events of a pixel represents the frequency of which the temporal intensity change exceeds the threshold. In rate coding, the serious temporal noise of the events is suppressed by integrating the events of each pixel in the segment.

#### Correlation Filter Framework

Generally, a CF tracker learns a discriminative classifier and finds the maximum value of the correlation response map as the estimation of the position of the target object. The resulting classifier is a 2-dimensional correlation filter which is applied to the feature representation. Circular correlation is utilized in CF framework for efficiently train. Multi-channel feature maps from multiple layers of a deep CNN are used as representation of rate-coding event-stream object. Feature maps of the l-th layer is denoted as x <sup>l</sup> with the size of H × W × C, where H, W and C denote the width, height, and channel number, respectively. The correlation filter f<sup>t</sup> has the same size with the feature maps in the current event frame t. In CF framework, the correlation filter is trained by solving a linear least-squares problem as follows,

$$\mathcal{W} = \underset{f\_l}{\text{arg min}} \sum\_{h, \mathbf{w}} \left\| f\_l \cdot \boldsymbol{\varkappa}\_{h, \mathbf{w}}^l - \boldsymbol{\jmath}\_{h, \mathbf{w}} \right\|^2 + \lambda \left\| f\_l \right\|\_2^2 \tag{2}$$

where x l h,w demotes the shifted sample. hε{0,1,2,. . . ,H-1}, wε{0, 1, 2, . . . , W-1}. yh,<sup>w</sup> is the Gaussian function label, and where λ is a regularization parameter (λ > 0). yh,<sup>w</sup> = exp − (h−H/2) <sup>2</sup>+(w−W/2) 2 2σ 2 , where σ is the kernel width.

The minimization problem in (2) can be solved in each individual feature channel using fast Fourier transformation (FFT). We use the capital letters as the corresponding Fourier transform of the signal. The learned correlation filter in frequency domain on the c-th, cε{1, 2, . . . , C} channel is as follows,

$$F^\epsilon = \frac{Y \odot \overline{X}^\epsilon}{\sum\_{i=1}^C X^i \odot \overline{X}^i + \lambda} \tag{3}$$

where the operation ⊙ denotes the element-wise product, Y is the Fourier transformation form of y={yh,w | hε{0, 1, 2, . . . , H-1}, wε{0, 1, 2, . . . , W-1}}, and the bar means complex conjugation. z<sup>l</sup> with size of H × W × C represents the feature maps of the l-th layer of the neural network. The l-th correlation response map of size H × W can be calculated as follows,

$$r^l = \xi^{-1}(\sum\_{c=1}^{C} F^c \odot \overline{Z}^c) \tag{4}$$

where the operation ξ <sup>−</sup><sup>1</sup> denotes the inverse FFT transform. The position of maximum value of the correlation response map r l is used as the estimation of the target location on the l-th convolutional layer.

#### Representation Based on Convolutional Neural Network

In this paper, hierarchical convolutional feature representation is used for encoding the appearance of the event-stream object. An imagenet-pretrained 16-layer classification network (VGG-Net-16)<sup>1</sup> implemented based on the MatConvNet library is

<sup>1</sup>The model can be download from: https://pan.baidu.com/s/ 1QWl6zw4DpMSCDg2X-Mb7mAA

used in our method for feature extraction. **Figure 2** shows the network architecture of the VGG-Net-16. Hierarchical feature representation are obtained with the CNN forward propagation. Event streams in a short duration are integrated into a rate coding map which is taken an the input of CNN. As the input of the original VGGNet are three-channel, we set each channel of the input layer equal to the rate coding map. As the pooling operation would reduce the spatial resolution with increasing depth of convolutional layers, we first remove the layers higher than the conv3\_3 layer and the output of conv1\_1, conv2\_2, and conv3\_3 are taken as the feature. Multiple convolutional layers are combined to encode the changed appearance of the eventstream object. **Table 2** shows the spatial size and channels of the feature maps of the input layer, and three different convolutional layers. Instead of resizing the size of the input rate coding maps to equal the size of the input layer of CNN (i.e., 224 × 224) in Ma et al. (2015), we use the resulted model parameters of each layer to perform convolutional operation on the original input. In this work, we test some hierarchical composition of different convolutional layers for feature representation and found that the representation of combination of multiple layers of conv1\_1, conv2\_2, and conv3\_3 could achieved a satisfactory tracking performance.

## EXPERIMENTAL RESULTS AND DISCUSSION

In this section, a series of experiments on several event-stream recordings are presented. The metric used in this paper is the center location error. We labeled the ground-truth position of tracked object manually. Section 4.1 introduces the eventstream tracking dataset on which we perform the tracking experiments. In section 4.2, we evaluate the influence of two hyper-parameters. Section 4.3 presents the tracking speed over different convolutional layers of CNN. In section 4.4, we compare the performance of the proposed method with some other eventstream tracking methods.


#### TABLE 1 | Challenges of each recording.

*If the recording (in the row) has this challenge (in the column), then the corresponding value is set to 1, otherwise 0.*

#### Event-Stream Recordings

Eight event-stream recordings with labeled ground truth data are used to evaluate the proposed method. Among them, three recordings were captured by a DVS128 sensor by ourselves, and the rest are from an event-stream tracking dataset captured by a DAVIS sensor.

A. Three DVS128 recordings<sup>2</sup> Three event-stream recordings of three different scenes were captured with a DVS128 device. These recordings have the same spatial resolution of 128 × 128. We divided each recording into multiple segments and in each segment, we label the position and the size of the target object with bounding boxes. The first recording ("Digit3") is a scene containing several digits. The task is to track the digit3 in the event streams. This recording has a time range of about 38.8 s which is divided into 767 segments. The second recording ("Horse toy") is a moving human with a horse toy in his hand. The task is to track the horse in the event streams. This recording has a time duration of about 17.3 s and is divided into 347 segments. The appearance of the toy changed quickly with much rotation and deformation in the recording. The third recording ("Human face") has a duration of about 17.1 s and is divided in 343 segments. This task is to track the face of a human. The human face moved quickly with rotation and deformation and the appearance of the event stream changed all the time, which make the task difficult.

B. DAVIS recordings<sup>3</sup> Yuhuang Hu et al. (2016) proposed an event-stream tracking dataset with a DAViS240C camera which has a spatial resolution of 240 × 180. In some tracking sequences of this tracking dataset, the target objects are still, or cannot be distinguished from the background. We chose five recordings from this dataset and re-labeled the bounding boxes of the target objects because the provided bounding boxes in the dataset seem to have a little shift compared to the accurate positions of the target objects. These chosen five recordings are "Sylvestr," "Vid\_B\_cup," "Vid\_C\_juice," "Vid\_E\_person\_part\_occluded," and "Vid\_J\_person\_floor," and are divided into 1344, 629, 404, 305, and 388 segments, respectively.

C. Challenges in each recording. We list the challenges in each recording as show in **Table 1**. Seven challenges are taken into consider, including the noise events, complicated background, occlusion, intersected trajectories, deformation, scale variation and pose variation. One or several challenges are contained in each recording.

#### Robustness to Hyperparameters

The proposed event-stream pattern tracking method requires the specification of two hyparameters, i.e., the event number in each segment and the convolutional layers for feature representation. To investigate the influence of these two hyperparameters, we performed a series of experiments on several eventstream recordings.

A. Event number in each segment. We investigate the influence of the number of events in each segment on several recordings, including the "Horse toy" from DVS128 sensor and the "Sylvestr," "Vid\_J\_person\_floor" from DAVIS sensor. **Figure 3** shows the tracking trajectory of the proposed tracker under different event number in each segment. The change of the x position and y position of the center of the tracker along time are shown. As the number of segments would be different from that of the groundtruth with a different event number in each segment, it would be impossible to compare the location of the trackers when both have different number of segments. Then we use the index of the events to represent the time coordinate. We plot the curve of the x position and y position of the tracker over the index of events. Results show that the proposed method is robust to the number of events in each segment by comparing the degree of proximity of the trajectory of the tracker and the groundtruth. The effectiveness of the proposed method is owing to the discriminative feature representation transferred from the multiple layers of a pre-trained CNN on the computer vision task. The proposed method has many potential applications in many the high-speed scenes with less events in each time window.

B. Feature map from different layers in the CNN. We investigate the influence of different convolutional layers of VGG-Net-16 model on the tracking performance. Feature representations with four kinds of combination of convolutional layers, i.e., the Conv1\_1 convolutional layer, the Conv2\_2

<sup>2</sup>The DVS128 recordings and the groundtruth, named "DVS recordings and groundtruth.zip" can be download from: https://figshare.com/s/ 70565903453eef7c3965

<sup>3</sup>The DAVIS recordings and the groundtruth, named "DAVIS recordings and groundtruth.zip" can be download from: https://figshare.com/s/ 70565903453eef7c3965

layer, the Conv3\_3 layer, and the composition of the three convolutional layers were evaluated. **Table 2** shows the spatial size and dimensionality of the feature maps of the input layer, and three different convolutional layers. All the DAVIS recordings ("Sylvestr," "Vid\_B\_cup," "Vid\_C\_juice," "Vid\_E\_person\_part\_occluded," and "Vid\_J\_person\_floor") are used in this test. For the "Vid\_B\_cup" scene, the tracking would fail when a single convolutional layer or a composition of two layers (Conv1\_1 or Conv2\_2 or Conv3\_3) is used. In "Vid\_B\_cup," the target cup is moved over a complicated background, then the events from the target object are very easy to be mixed up with the events from background. **Figure 4** shows the metric results on the rest four event-stream recordings with different convolutional layers. Feature representation from the higher convolutional layers results in better tracking performance. For the more complicated scenes with complex background, such as the "Vid\_B\_cup," combination of hierarchical feature representation from multiple convolutional layers is required for effective object tracking. This is because the feature representation from multiple convolutional layers combines the low-level texture features and high-level semantic features and can handle the rapid change of the appearance of the target object. While for the relatively simple scenes with less noise events and simple background textures, less and lower convolutional layers result in effective tracking with a higher speed.

#### Tracking Speed Under Different Layers

In this section, we investigate the tracking speed of the proposed method on several recordings. The speed is measured over different convolutional layers of the employed CNN. **Table 3** shows the results of tracking speed on five DAVIS recordings. We measure the tracking speed using the unit of segments per second which is similar to the frame per second in the computer vision tracking. In the experiments, a PC machine with Intel(R) Core(TM)i5-7300HQ CPU @ 2.5 GHz is used. We did not present the measurement result on some convolutional layers in some event-stream recordings



*Input layer denotes the input rate coding map, after the necessary preprocessing steps.*

(such as the Conv1\_1, and Conv2\_2 in the "Vid\_C\_ juice" scene) which have failed in the tracking of the corresponding target object.

Intuitively, high-level representation requires more computational operations, which leads to the decrease of the tracking speed. In the proposed method, the precision and the speed are a tradeoff. In the simple scenes, such as monitoring an object with a fixed sensor, low-level convolutional layers are enough for effectively tracking with high speed. In other complicated scenes such as complicated background textures or moving DVS sensor, high-level convolutional layers are demanded for accurate tracking, which limits the tracking speed. In fact, due to the high computational efficiency in the frequency

TABLE 3 | Tracking speed under different convolutional layers of the employed network.


*The tracking speed is measured by the segments per second. We measure the tracking speed on the DAVIS recordings.*

domain of the CF mechanism, the proposed method achieved relatively high tracking speed even using high-level feature map.

## Comparison With Other Methods

In this section, we compare the performance of the proposed method to several other event-stream pattern tracking methods. We perform the experiments on all the eight event-stream recordings. In the first place, we introduced the four other tracking methods as follows:

#### A. Three Tracking Methods in jAER Software

These three methods are based on the three event filters in jAER source available within the jAER sourceforge repository, including "Rectangular Cluster Tracker," "Einstein Tracker," and "Median Tracker," respectively. Since the three methods have failed in the more complicated scenes, we only provide the tracking results of the three methods on the three simple scenes captured by DVS128 sensor.

#### B. Compressive Tracking-Based Rate Coding Feature

This is a feature-based event-stream tracking method based on compressive sensing and has achieved good performance on some simple scenes captured by DVS128. The compressive tracking learns a classifier on the compressive coding of multi-scale haar-like feature extracted on the rate coding map. We provide the results of this method on all the event-stream recordings for comparison with the proposed method.

**Figure 5** shows the comparison results of the proposed method with all the four methods on the DVS128 recordings. Center location error is used to measure the tracking performance. Results show that the proposed tracking method achieves the best performance. Three tracking methods in jAER software show poor tracking performance. The eventdriven jAER methods often failed to assign each event to the accurate position or cluster of the target object. Compared to the compressive tracking method, the proposed method achieved better performance owning to the more discriminative CNN feature presentation than the haar-like feature in the compressive tracking.

On the five DAVIS recordings, we only compared the tracking performance of the proposed method with the compressive tracking method because the three tracking methods in jAER software fail to track the target object in the more complicated scenes. **Figure 6** shows the tracking location precision of the two methods. The proposed method achieved better performance on the five DAVIS recordings. Especially in the "Vid\_B\_cup" scene where many objects in the background have the same shape with the target object "cup," the proposed tracker can track the target object correctly. Simple hand-crafted feature representation cannot handle many complicated scenes with noise events, complicated background textures, rapid changed appearance, and occlusion. By combining the low-level texture features and high-level semantic features, the feature representation from multiple convolutional layers can handle the complicated scenes very well. Results demonstrate that the proposed method is

robust to many challenging visual scenes with better tracking performance than other methods.

**Figure 7** shows some tracking examples by integrating the events into reconstructed frames. In the tracking process, we did not change the scale of the tracker. The proposed tracker tracked the target object with high location precision while the compressive sensing-based tracker drifts very easily. Even with complicated background and occlusion, our tracker achieved very high tracking precision.

## CONCLUSION

In this work, we proposed a robust event-stream object tracking method based on the CF tracking mechanism. Our method overcomes some challenges in the event-stream tracking, such as the noise events, chaos of the complex background texture, occlusion, and randomness of event generating in the pixel circuit. Rate coding is used to encode the visual information of the target event-stream object. Correlation response map is computed on the feature representation from the hierarchical convolutional layers of a pre-trained deep CNN.

The proposed method shows good performance in many complicated visual scenes. Compared with other feature-based methods, our method is more robust in many visual scenes with noise and complicated background textures. Compared with other event-driven method, the proposed method has real-time advantage on event streams with large number of events. To utilize each single event for tracking and updating the appearance of the target object without segment reconstruction is a more interesting and challenging task which has lied in the heart of current event-based vision research and will be explored in the future. In event-driven tracking tasks, how to suppress the noise events is an import step for correct tracking as the noise events will lead the tracker to make wrong estimation.

## DATA AVAILABILITY STATEMENT

Publicly available datasets were analyzed in this study. This data can be found here: https://figshare. com/s/70565903453eef7c3965; https://figshare.com/s/ 70565903453eef7c3965.

red, and blue color represent the groundtruth, our proposed tracker and the compressive sensing based tracker, respectively.

## AUTHOR CONTRIBUTIONS

HL proposes the algorithm and design the experiment setup. Besides, data analysis and classification metric are also done by HL. LS supports the research of the neuromorphic vision and contributes the principle of event-based object tracking.

## FUNDING

The work was partially supported by the Project of NSFC (No. 61327902), Beijing program on study of functional chip and related core technologies of ten million class of brain inspired computing system (Z151100000915071), and Study of Brain-Inspired Computing System of Tsinghua University program (20141080934, 20151080467).

## REFERENCES


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Li and Shi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# An Investigation of Vehicle Behavior Prediction Using a Vector Power Representation to Encode Spatial Positions of Multiple Objects and Neural Networks

#### Florian Mirus 1,2 \*, Peter Blouw<sup>3</sup> , Terrence C. Stewart <sup>3</sup> and Jörg Conradt <sup>4</sup>

*<sup>1</sup> BMW Group, Research, New Technologies, Garching, Germany, <sup>2</sup> Department of Electrical and Computer Engineering, Technical University of Munich, Munich, Germany, <sup>3</sup> Applied Brain Research Inc., Waterloo, ON, Canada, <sup>4</sup> Department of Computational Science and Technology, KTH Royal Institute of Technology, Stockholm, Sweden*

#### Edited by:

*Pascual Campoy, Polytechnic University of Madrid, Spain*

#### Reviewed by:

*Subramanian Ramamoorthy, University of Edinburgh, United Kingdom Michael Beyeler, University of Washington, United States*

\*Correspondence: *Florian Mirus florian.mirus@bmwgroup.com*

Received: *15 January 2019* Accepted: *26 September 2019* Published: *16 October 2019*

#### Citation:

*Mirus F, Blouw P, Stewart TC and Conradt J (2019) An Investigation of Vehicle Behavior Prediction Using a Vector Power Representation to Encode Spatial Positions of Multiple Objects and Neural Networks. Front. Neurorobot. 13:84. doi: 10.3389/fnbot.2019.00084*

Predicting future behavior and positions of other traffic participants from observations is a key problem that needs to be solved by human drivers and automated vehicles alike to safely navigate their environment and to reach their desired goal. In this paper, we expand on previous work on an automotive environment model based on vector symbolic architectures (VSAs). We investigate a vector-representation to encapsulate spatial information of multiple objects based on a convolutive power encoding. Assuming that future positions of vehicles are influenced not only by their own past positions and dynamics (e.g., velocity and acceleration) but also by the behavior of the other traffic participants in the vehicle's surroundings, our motivation is 3-fold: we hypothesize that our structured vector-representation will be able to capture these relations and mutual influence between multiple traffic participants. Furthermore, the dimension of the encoding vectors remains fixed while being independent of the number of other vehicles encoded in addition to the target vehicle. Finally, a VSA-based encoding allows us to combine symbol-like processing with the advantages of neural network learning. In this work, we use our vector representation as input for a long short-term memory (LSTM) network for sequence to sequence prediction of vehicle positions. In an extensive evaluation, we compare this approach to other LSTM-based benchmark systems using alternative data encoding schemes, simple feed-forward neural networks as well as a simple linear prediction model for reference. We analyze advantages and drawbacks of the presented methods and identify specific driving situations where our approach performs best. We use characteristics specifying such situations as a foundation for an online-learning mixture-of-experts prototype, which chooses at run time between several available predictors depending on the current driving situation to achieve the best possible forecast.

Keywords: vehicle prediction, long short-term memories, artificial neural networks, vector symbolic architectures, online learning, spiking neural networks

## 1. INTRODUCTION

The race to autonomous driving is currently one of the main forces for pushing research forward in the automotive domain. With highly automated vehicle prototypes gradually making their way to our public roads and fully-automated driving on the horizon, it seems to be a matter of time until we see robot taxis or cars navigating us through urban traffic or heavy stop-and-go on highways. One major reason for this development in recent years is the rapid progress of artificial intelligence (AI), especially the success of deep learning, which has shown remarkable results in tasks essential for automated driving like object detection, classification (Ciresan et al., 2012) and control (Bojarski et al., 2016).

Predicting future behavior and positions of other traffic participants from observations is essential for collision avoidance and thus safe motion planning, and needs to be solved by human drivers and automated vehicles alike to reach their desired goal. However, future positions of vehicles not only depend on each vehicle's own past positions and dynamics like velocity and acceleration, but also on the behavior of the other traffic participants in the vehicle's surroundings. Motion prediction for intelligent vehicles in general has seen extensive research in recent years (Polychronopoulos et al., 2007; Lawitzky et al., 2013; Lefèvre et al., 2014; Schmüdderich et al., 2015) as it is a cornerstone for collision-free automated driving. Lefèvre et al. (2014) classify such prediction approaches into three categories, namely physics-based, maneuver-based, and interaction-aware, depending on their level of abstraction. Physics-based and maneuver-based motion models consider the law of physics and the intended driving maneuver, respectively as the only influencing factors for future vehicle motion and ignore interdependencies between the motion of different vehicles. There exist a growing number of different interaction-aware approaches to account for those dependencies and mutual influences between traffic participants or, more generally, agents in the scene. Probabilistic models like costmaps (Bahram et al., 2016) account for physical constraints on the movements of the other vehicles. Classification approaches categorize and represent scenes in a hierarchy (Bonnin et al., 2012) based on the most generic ones to predict behavior for a variety of different situations.

Data-driven approaches to behavior prediction mainly rely on long short-term memory (LSTM) neural network architectures (Hochreiter and Schmidhuber, 1997), which have proven to be a powerful tool for sequential data analysis. Alahi et al. (2016) model interactions in the learning network architecture by introducing so-called social-pooling layers to connect several LSTM each predicting the distribution of the trajectory position of one agent at a time. Deo and Trivedi (2018a) adapted the combination of LSTM networks for encoding vehicle trajectories and (convolutional) social-pooling layers to account for interactions to vehicle prediction in highway situations. Altche and de La Fortelle (2017) use a LSTM network as well, but they account for interaction by including distances between the target vehicle and other agents directly in the training data rather than adapting the network architecture. A similar approach is proposed by Deo and Trivedi (2018b), but they combine LSTM networks with an additional maneuver classification network to predict future vehicle motion. One issue in data-driven approaches to behavior prediction accounting for relations between agents is that the number of other vehicles is variable. If information about vehicles other than the target are encapsulated in the input of the neural network, typically a specific number of other vehicles of interest are manually chosen a priori to avoid this issue (Altche and de La Fortelle, 2017; Deo and Trivedi, 2018b). If the information about other vehicles is encapsulated in the network architecture, it might be necessary to train several networks depending on the situation at hand.

In this paper, we expand our previous work (Mirus et al., 2018) on an automotive environment model based on VSAs (Gayler, 2003). In contrast to the representation shown in Mirus et al. (2018), this paper introduces a more sophisticated way of encapsulating spatial information of multiple objects in semantic vectors of fixed length. Therefore, we employ generalized exponentiation of high-dimensional vectors, referred to as the convolutive power, based on the VSA's binding operation, which in our case is circular convolution. We hypothesize that structured vector representations will be able to capture relations and mutual influence between traffic participants. For instance, in a situation as shown in **Figures 1A–C**, the behavior of the target vehicle, as depicted by a solid blue and dotted orange line for past and future positions, respectively, is influenced by the surrounding vehicles, as illustrated by solid and dotted gray lines for past and future positions, respectively. The target vehicle is approached from behind by a faster vehicle causing it to eventually change its lane to the right, which, for now, is still occupied by the ego-vehicle and another vehicle. All of these external influences have an impact on the target vehicle's motion (and vice versa). As we aim for a model-free data representation, we avoid introducing any physical constraints or restrictions regarding our data-representation or the predicting models.

In this work, we consider our main contributions to be the following: we present a representation of spatial information for multiple objects in semantic vectors of fixed length using the convolutive power. We use this representation as input for simple feed-forward neural networks and more sophisticated LSTMbased models and compare them against each other and a linear predictor as the simplest baseline. We conduct a thorough and detailed analysis for all of these models and show that by using our vector representation with a simple network architecture we can achieve results comparable to more complex models. This is particularly interesting for mobile applications, such as automated driving: combining our vector representation, which allows to encode spatial positions of several objects as well as efficient implementation in spiking neural networks (SNNs) (Eliasmith, 2013), with a simple feed-forward SNN would allow future deployment on dedicated, energy-efficient neuromorphic hardware. In case the performance of the simpler feed-forward networks is close enough to the more sophisticated ones, the possibility of efficient deployment could be an advantage over LSTM networks, which are by design harder to apply to dedicated computing hardware (Chang and Culurciello, 2017). Finally, we present a prototype of a mixture-of-experts online learning system, that chooses at run-time between several models, which

have been pre-trained offline, to achieve the best possible forecast. We show, that this online learning approach is able to improve its performance compared to the individual prediction models.

## 2. MATERIALS AND METHODS

## 2.1. Vector Symbolic Architectures

The term vector symbolic architectures (VSAs)—first coined by Gayler (2003)—refers to a family of approaches for cognitive modeling making use of distributed representations. The basic idea behind all of those approaches is to represent structure (e.g., cognitive concepts, symbols, or language) in a highdimensional vector space by mapping each entity to be represented to a (possibly random) vector. One strength of VSAs is that they offer the possibility to manipulate their entities through algebraic operations, typically one addition-like and multiplication-like operation each. Vectors, which represent basic concepts not intended to be further decomposable and thus are not constructed from other vectors using the VSA's algebraic operations, are called atomic vectors.

#### 2.1.1. Prerequisites

In this paper, we adopt the semantic pointer architecture (SPA) (Eliasmith, 2013), a variant of holographic reduced representations (HRRs) originally introduced by Plate (1994), to encode automotive scenes in high-dimensional vectors (note: the source-code for all models presented in this paper can be found at https://github.com/fmirus/spatrajectoryprediction). Thus, atomic vectors are picked from the real-valued unit sphere and the dot product serves as a measure of similarity. We call two vectors similar, if their dot-product is higher than a certain similarity threshold. The distribution of the dot-product of two randomly chosen unit vectors has a mean of 0 and a standard deviation of √ 1 D (Widdows and Cohen, 2014). Thus, the similarity threshold is typically chosen as <sup>√</sup><sup>c</sup> D for some constant c, which is a similarity value that we would expect from two randomly chosen vectors and only depends on the dimension D of the vector space. Furthermore, the algebraic operations are component-wise vector addition ⊕ and circular convolution ⊛, which is defined as

$$\mathbf{z} = \mathbf{v} \circledast \mathbf{w} \qquad \text{with } z\_j := \sum\_{k=0}^{D-1} \nu\_k \boldsymbol{w}\_{(j-k)\text{ mod }D} \tag{1}$$

for any two D-dimensional vectors **v**,**w**. One important property of this operation is the fact (Bracewell, 2000, Chapter 6), that circular convolution can efficiently be computed using the discrete Fourier transform:

$$\mathbf{v} \circledast \mathbf{w} = IDFT\left(DFT(\mathbf{v}) \odot DFT(\mathbf{w})\right),\tag{2}$$

where ⊙ denotes element-wise multiplication, DFT and IDFT denote the discrete Fourier transform and inverse discrete Fourier transform, respectively. The neutral element regarding circular convolution is 1 = (1, 0, · · · , 0). Furthermore, for any vector **v**, the vector **v**¯ = (v0, vD−1, . . . , v1) is a pseudo-inverse element with respect to circular convolution, meaning that the vector derived from convolving them is similar to the neural element, i.e., **v** ⊛ **v**¯ ≈ **1**. Although we can also find an exact inverse element **v** −1 for most vectors with **v** ⊛ **v** <sup>−</sup><sup>1</sup> = **1**, it is often more useful to work with pseudo-inverses instead of exact inverse elements, as they can become unstable in certain situations. However, we call the special class of vectors for which the pseudoand exact inverse element coincide unitary vectors, i.e., **v** <sup>−</sup><sup>1</sup> = ¯**v**.

Using Equation (2), we define the convolutive power of a vector v by an exponent p ∈ R as

$$\mathbf{v}^{\mathfrak{p}} := \mathfrak{R}\left(IDFT\left(\left(DFT\_0\left(\mathbf{v}\right)^{\mathfrak{p}}\right), \dots, \left(DFT\_{D-1}\left(\mathbf{v}\right)^{\mathfrak{p}}\right)\right)\right), \tag{3}$$

where ℜ denotes the real part of a complex number and DFT<sup>i</sup> (**v**) denotes the ith component of the vector DFT (**v**). Denoting the set of unitary vectors by U, we state three essential properties


#### 2.1.2. Convolutive-Power Representation

In this paper, we adopt and improve the vector representation for automotive scenes introduced in earlier work (Mirus et al., 2018). Here, we introduce the convolutive vector-power shown in Equation (3) for encoding spatial positions of multiple vehicles and focus on investigating its expressive power. To create a vocabulary V of atomic vectors, we assign a random real-valued vector from the unit sphere to each category of dynamic objects (e.g., car, motorcycle, truck) as well as random unitary vectors **X** and **Y** to encode spatial positions. We use unitary vectors for **X** and **Y** as they have unit length and are closed under convolutive exponentiation. Therefore, by encoding spatial positions with powers of unitary vectors, we avoid exploding lengths of our final scene vectors, which would lead to additional noise and unwanted behavior when using them as input for neural networks. Furthermore, we use additional random ID-vectors **TARGET** and **EGO** representing the target object to be predicted and, if applicable, the ego-vehicle, respectively.

Given a situation as shown in **Figure 1A** with a sequence of prior positions (x<sup>t</sup> , yt) for the target vehicle at time step t ∈ {t0, . . . , tN} and equivalent sequences (xobj,<sup>t</sup> , yobj,<sup>t</sup> ) for other traffic participants, we encapsulate this positional information in a scene vector

$$\mathbf{S}\_{l} = \underbrace{\mathbf{TARGET} \otimes \mathbf{TYPE}\_{\text{target}} \otimes \mathbf{X}^{\text{x}\_{l}} \otimes \mathbf{Y}^{\text{y}\_{l}}}\_{\text{target-vehicle}}$$

$$\oplus \underbrace{\sum\_{obj} \mathbf{TYPE}\_{obj} \otimes \mathbf{X}^{\text{x}\_{obj,t}} \otimes \mathbf{Y}^{\text{y}\_{obj,t}}}\_{\text{other objects}}\tag{4}$$

for each time step t. This yields a sequence of semantic scene vectors **S**<sup>t</sup> for t ∈ {t0, . . . , tN} encoding the past spatial development of objects in the current driving situation. **Figure 2** depicts the aforementioned scene vector representation: the left plots show similarities (depicted as heat map) between the vector **S**<sup>t</sup> encoding the scene from **Figure 1A** and the vectors v<sup>i</sup> = **TARGET** ⊛ **TYPE**target ⊛ **X** <sup>x</sup>¯<sup>i</sup> ⊛ **Y** y¯i for a sequence of discrete position samples x¯<sup>i</sup> , y¯i . Similarly, the right plots show similarities between **S**<sup>t</sup> and **CAR** ⊛ **X** <sup>x</sup>¯<sup>i</sup> ⊛ **Y** <sup>y</sup>¯<sup>i</sup> visualizing all other objects in the scene of type car. We observe clear peaks (bright yellow areas) of higher similarities at the true positions of the encoded objects depending on their type (e.g., car or truck) or if the object is the target object of interest. Hence, we can encode spatial information of several different objects in a sequence of semantic vectors and reliably decode it back out. This allows us to encode automotive scenes with varying number of dynamic objects in a vector representation of fixed dimensionality.

#### 2.2. Models 2.2.1. LSTM Networks

In this work, we use a long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) network-architecture for the prediction of vehicle positions. Our network consists of one LSTM encoder and decoder cell for sequence to sequence prediction, which means that the input and the final result of our model is sequential data. The encoder LSTM takes positional data for 20 past, equidistant time frames as input. That is, the input data is a sequence of 20 items of either positions of the target vehicle or a sequence of high-dimensional vectors encoding this positional data (see sections 1, 2.3.4 for further details). Thus, the resulting embedding vector encodes the history of the input data over those 20 time frames. This embedding vector is concatenated with additional auxiliary information to aid the model when predicting the future trajectory of the target vehicle. This auxiliary data is information, that is available to the system when the prediction is to happen, i.e., sensory data available at prediction time or future data about the egovehicle, such as its own planned trajectory (see section 3.1.1 for further details on this auxiliary data). Finally, the embedding vector is used as input for the decoder LSTM to predict future vehicle positions. The output of each model is a sequence of 20 positions of the target vehicle predicted over a certain temporal horizon into the future. We use the same network architecture for all encoding schemes of the input data and for both data sets. However, the dimensionality of the input varies over the different encoding schemes while the auxiliary information used to enrich the embedding vector is different depending on the data set (since only one data set is recorded from a driving ego-vehicle). We describe these implementation choices in more detail in section 3.1.1. **Figure 3** visualizes the architecture of our LSTM models indicating modules that change when varying the encoding scheme by a dashed red border whereas parts that change with the data set are highlighted through a dashed blue border.

#### 2.2.2. NEF Networks

As an alternative to the LSTM-models, we also considered a much simpler single-hidden-layer network defined using the

neural engineering framework (NEF) (Eliasmith and Anderson, 2003). While this is usually used for constructing large-scale biologically realistic neuron models (Eliasmith et al., 2012), the NEF software toolkit Nengo (Bekolay et al., 2014) also allows for traditional feed-forward artificial neural networks using either spiking or non-spiking neurons. Spiking neurons are of considerable interest for vehicle prediction algorithms due to the potential for reduced power consumption when run on hardware that is optimized for spiking neurons (i.e., neuromorphic hardware).

For these NEF networks, we use a single hidden layer containing N neurons, with randomly generated (and fixed) input weights, and use least-squares optimization to compute the output weights. That is, given the hidden layer spiking activity a<sup>i</sup> for the ith neuron (i.e., a sequence of spikes)

$$a\_i\left(\mathbf{x}(t)\right) = \sum\_{j=1}^{m\_i} h(t) \* \delta(t - t\_j) = \sum\_{j=1}^{m\_i} h(t - t\_j),\tag{5}$$

where δ denotes the delta function, h(t) is the post-synaptic current produced by a single spike and t<sup>j</sup> are the m<sup>i</sup> spike times of the ith neuron, we compute the network output **y** with output weights **d**<sup>i</sup> as

$$\mathbf{y(t)} = \sum\_{i=1}^{N} a\_i \left(\mathbf{x}(t)\right) \mathbf{d}\_i. \tag{6}$$

If we have a desired **y**(t) for every given input to the network, then we can provide that input, measure the resulting hidden layer activity for each input, and then find the optimal **d**<sup>i</sup> values to make the network output match the desired output. This is a much faster alternative to using gradient descent rules (such as backpropagation). In particular, we find the **d**<sup>i</sup> that minimize

$$E = \int \left( \mathbf{y}(t) - \sum\_{i=1}^{N} a\_i \left( \mathbf{x}(t) \right) \mathbf{d}\_i \right)^2 d\mathbf{x}(t). \tag{7}$$

As with any traditional network, we can have any number of input, output, and hidden neurons, all following this same process. The goal here is to provide a simple baseline for comparison to the LSTM networks, to see what (if any) performance gain is produced by the more complex network approach.

#### 2.2.3. Mixture-of-Experts Online Learning

Given that we have multiple models p<sup>i</sup> for i = 1, . . . , M for predicting vehicle positions, we also define mixture-of-experts models. These are models where the output is a weighted sum of the outputs from other models

$$\mathbf{v}\_{\text{mix},t} = \sum\_{p} \mathbf{W}\_{p,t} \mathbf{v}\_{p,t},\tag{8}$$

where **W**p,<sup>t</sup> is the weight and **v**p,<sup>t</sup> is the output value of the prediction model p for prediction time t. If each model produces a prediction of the x and y positions at N different time steps into the future and we have M models, w will be an M ×N ×2 tensor. In other words, the particular weighting of models for predicting 0.5 s into the future may be very different from the weighting when predicting 5.0 s into the future.

The simplest way to generate these weights is to use standard delta-rule learning

$$
\Delta \mathbf{W}\_{p,t} = \kappa \mathbf{v}\_{p,t} \underbrace{(\mathbf{v}\_{observed,t} - \mathbf{v}\_{mix,t})}\_{=\epsilon\_I} = \kappa \mathbf{v}\_{p,t} \epsilon\_I. \tag{9}
$$

where κ is a learning rate and ǫ<sup>t</sup> = **v**observed,t−**v**mix,<sup>t</sup> is the current prediction error, that is, the error between the mixture model's output **v**mix,<sup>t</sup> and the target vehicle's actual position **v**observed,<sup>t</sup> . For this paper, we initialize the weights **W**p,<sup>t</sup> to be 1/M (i.e., an equal weighting across all M models).

The above model attempts to find the best weighting of the offline models based only on the prediction error. However, it is also possible to learn a weighting that is based on the current context. That is, instead of learning **W**, we can learn the function f**W**(**c**) = **W**, where **c** is some currently available sensor information.

Since neural networks are good function approximators, we implement this context-sensitive mixture-of-experts model as a single-hidden-layer artificial neural network (ANN) whose inputs are **c** and whose outputs are **W**. As with the contextfree mixture-of-experts model, we initialize the output to always produce 1/M, and then train the network based on the prediction error.

Importantly, this context-sensitive mixture-of-experts model is meant to be trained online. That is, we do not pre-train it on a large corpus of data and then fix the final weights. Instead, we run the neural network training while the system is running, just like the context-free version. Indeed, the context-free version is equivalent to the context-sensitive model if the context is kept constant.

While any neural network learning algorithm could be used here, for simplicity we use delta rule again, and note that the delta rule is the first step of the classic backpropagation neural network learning algorithm. In other words, we only adjust the weights between the hidden layer and the output layer, and leave the other set of weights at their initial randomly generated values. This greatly reduces the computation cost of performing the online learning.

**Figure 4** shows a schematic visualization of the mixture-ofexpert model's architecture. Yellow boxes indicate the individual components of the model, while the solid red line depicts the connection to decode out the weights **W**p,<sup>t</sup> for the individual expert predictors from the neural population encoding the context **c** as indicated by the green circles in the context component. Finally, the dotted green arrow indicates that the error signal is used to update the weights of this connection using delta-rule learning.

## 2.3. Data and Pre-processing

In this work, we use two different data sets for training and evaluation of our system, which we describe in more detail in the subsequent sections. We refer to those data sets as Onboard or D<sup>1</sup> (see section 2.3.1), which is our main data set, and NGSIM or D<sup>2</sup> (see section 2.3.2), which is a publicly available data set used for reference and comparability. In this section, we describe both data sets regarding available features, available sources of information as well as their key differences and the preprocessing procedure.

#### 2.3.1. On-Board-Sensors Data Set

This is our main data set used in this work. It contains realworld data gathered using the (ego-) vehicle's on-board sensors during test drives mainly on highways in southern Germany. The data contains object-lists with a variety of features obtained from different sensor sources. Apart from features about motion and behavior of the dynamic objects in the scene like position, velocity and acceleration, which are estimated from light detection and ranging (LIDAR) sensors, there is also visual information like object type probabilities or lane information, which is acquired from additional camera sensors. More detailed information on the test vehicle's sensor setup can be found in Aeberhard et al. (2015). The fused information about objects is available at a frequency of roughly 5 Hz. The main feature of this data set is that all information about other vehicles, such as position or velocity are measured with respect to the ego-vehicle and its coordinate system. The On-board data set contains 3,891 vehicles, which yield a total length of roughly 28.3 h when adding up the time each individual vehicle is visible.

#### 2.3.2. NGSIM US-101 Data Set

The next generation simulation (NGSIM) US-101 data set (Colyar and Halkias, 2017) is a publicly available data set recorded on a segment of ∼640 m length with 6 lanes on the US-101 freeway in Los Angeles, California. Although the data set was originally intended for driver behavior and traffic flow models (He, 2017), it has also been used to train trajectory predictions models (Altche and de La Fortelle, 2017; Deo and Trivedi, 2018b). The data set was recorded using cameras observing freeway traffic from rooftops with trajectory-data being extracted later from the obtained video footage. It holds a total of 45 min of driving data split into three 15 min segments of mild, moderate and congesting traffic conditions. Apart from positional information in lateral and longitudinal direction (in a global and local coordinate system), additional features like instantaneous velocity, acceleration, vehicle size as well as the current lane are available for each vehicle. The trajectory data is sampled with a frequency of 10 Hz. The main difference to the On-board data set is the fact, that the NGSIM data set is recorded with an external stationary camera instead of on-board sensors of a driving vehicle. Thus, there is no ego-vehicle present in the data and all information are available in absolute coordinates instead of being measured relative to one particular ego-vehicle. The NGSIM data set contains 5,930 vehicles and therefore a total time of roughly 91.3 h when adding up the time each individual vehicle is visible.

#### 2.3.3. Pre-processing

In this section, we describe the preprocessing steps performed a priori to prepare the information from our two data sets as neural network input. Although we aim to keep these preprocessing steps as consistent as possible across the data sets, there are some mild differences, which we will also point out. We aim to predict future positions of dynamic objects 5 s into the future based on their positions 5 s prior to their current location. As the two data sets are sampled at different frequencies, we interpolate the available data over 20 equidistant steps to achieve intervals of 0.25 s to improve consistency and comparability. Furthermore, we translate the current position of the target vehicle (the vehicle to be predicted) into the origin, i.e., position (0, 0) (see **Figure 1**), to prevent our models from treating similar trajectories differently due to positional variations. Finally, to improve suitability of the data as input for neural networks, we divide all x-positions by a factor of 10 such that x-/y-values are scaled to a similar order of magnitude. Thus, one data sample consists of a sequence of length 20 of positional information over the past 5 s, which is used as input for our models with different encoding, and a sequence of 20 positions 5 s into the future used as labels or ground truth for the models to be trained with. For the NGSIM data set D2, we use only every 10th data point, to avoid the creation of too many overlapping, and therefore too similar, data samples. Furthermore, we converted all values to the metric system and swapped the dimensions of the positions in D<sup>2</sup> such that for both data sets x- and y-direction correspond to longitudinal and lateral positions, respectively. For training and evaluating our models, we split both data sets into training T<sup>i</sup> ⊂ D<sup>i</sup> and validation data V<sup>i</sup> ⊂ D<sup>i</sup> containing 90% and 10% of the objects, respectively to avoid testing our models on vehicles they have been trained with.

#### 2.3.4. Encoding Schemes

We use different encoding schemes of the positional input data in this work. The main encoding scheme is the convolutive vector-power representation as depicted in section 2.1.2. To avoid accumulation of noise while focusing on the vehicles most relevant for prediction, we only use objects closer than 40 m to the target vehicle in the On-board data set. For the NGSIM data set D2, we additionally include only objects on the same lane as the target vehicle and on adjacent lanes. Thereby, we aim for consistency across both data sets and we keep the input data as comparable as possible to what a driving vehicle could be able to detect using its on-board sensors.

For the On-board data set D1, we use two different variants of this representation, which differ in that the ego-vehicle's position is used or excluded in the other objects part of Equation (4), yielding two sequences (**S** ego t ) tN t0 and (**S**t) tN t0 . We used Nengo's SPA package for implementation and therefore refer to these encoding two schemes (**S**t) tN t0 and (**S** ego t ) tN t0 as "SPA-power" and "SPApower-with-ego," respectively. As the NGSIM data set D<sup>2</sup> does not contain an ego-vehicle, we only investigate the "SPA-power" encoding scheme there.

For a simple reference vector-representation, we add the positional vectors **X** and **Y** scaled with the target vehicle's prior positions (x<sup>t</sup> , yt) at each time step t, yielding the sequence **S**˜ <sup>t</sup> = xt · **X** + y<sup>t</sup> · **Y**. Finally, we also use plain numerical position values p<sup>t</sup> = (x<sup>t</sup> , yt) as input data. Note, that only the SPA-power representation variants (**S**t) tN t0 and (**S** ego t ) tN t0 contain positional information about vehicles other than the target.

## 3. EXPERIMENTS AND RESULTS

In this section, we describe the training process and parameters of all our models and give a detailed analysis and evaluation of the results achieved. The LSTM models are implemented in Tensorflow (Abadi et al., 2016) whereas the NEF models and the mixture-of-experts online learning model are implemented using the Nengo software suite (Bekolay et al., 2014). We use the root-mean-square error (RMSE) as our main metric for evaluation purposes. In contrast to earlier work, we inspect the RMSE for lateral and longitudinal directions separately to give more detailed insights into the models' behavior. Calculating the RMSE of the Euclidean distance would absorb the influence of the lateral RMSE since it is an order of magnitude smaller than the longitudinal RMSE, while we consider both directions to be at least equally important. The lateral RMSE is even more informative regarding the models' performance on, for instance, lane change maneuvers. Note however, that this means that the yaxes in **Figures 5**, **6**, **8**–**11** show a different order of magnitude for lateral (RMSE X) and longitudinal (RMSE Y) direction. Finally, we investigate where the models shows their best performance looking for correlations between prediction accuracy and specific driving situations.

**Table 1** summarizes the models evaluated in this section. The models LSTM SPA 1–3 as well as LSTM numerical employ the same network architecture as described in section 2.2.1 with sequential information as input data (using the different encoding schemes presented in section 2.3.4) and are analyzed in section 3.2.1. The models NEF SPA 1 and 2 employ the simpler, single-layer, feed-forward architecture as described in section 2.2.2 with a vector obtained as partial sum of vectors from the whole sequence used as input for the LSTM models (see section 3.1.2 for further details). Finally, mix online denotes the mixture-of-experts online learning model as described in section 2.2.3 using the predictions from some of the aforementioned offline models as input (see section 3.1.3 for further details). The models will be denoted in figure legends by their short name given in **Table 1**.

In section 2.3.4, we have described the different encoding schemes we will use to evaluate our models. We mentioned that the models employing the convolutive power to encode the input data are (i.e., LSTM SPA 1, 3 and NEF SPA 1 and 2) are the only ones having access to information about objects other than the target vehicle. Although these model therefore have access to more data than the other reference models, such as LSTM numerical, we are interested in evaluating the benefits of encoding the interconnections between vehicles implicitly in the input data using our semantic vector encoding instead of introducing a more complex network architecture. Therefore, we focus on the same network architecture for all encoding schemes in this paper and leave a comparison with more sophisticated network architectures, for instance, ones combining LSTM with social pooling layers as in Deo and Trivedi (2018a) or Alahi et al. (2016) for future work.

#### 3.1. Model Training 3.1.1. LSTM Networks

We trained several instantiations of our LSTM-network architecture as described in section 2.2.1 on the On-board data set D<sup>1</sup> in advance to find an optimal set of parameters. We varied the number of layers, the number of hidden dimensions and the number of epochs for the models to be trained. We found, that increasing the number of layers does not improve the models' performance on the validation data, even when training longer using more epochs. On the contrary, models with more layers needed more training time to achieve a performance on the validation data comparable to the networks with less layers. Thus, a LSTM model with one encoder and decoder cell each is not only the simplest network architecture but also the best in terms of accuracy as well as time needed for training.

For this architecture, we found that the network performs best with 150 dimensions in the encoder and decoder cell each. Furthermore, we employed early stopping, that is, we trained our models for 10 epochs as we found that the models' performance stagnate on both, training and validation data sets, when training for up until a total 20 epochs. **Figure 5** visualizes this result by showing the development of the RMSE of the LSTM SPA 1 model during the training process for the training set T<sup>1</sup> (**Figures 5A,C**) and validation set V<sup>1</sup> (**Figures 5B,D**) of the On-board data set D1. On the y-axis of each sub-figure, we have the RMSE while the xaxis from left to right depicts the result after each epoch during the training process. Each colored line illustrates the RMSE of the model for one particular prediction time step while all points with the same value on the x-axis depict the model's performance after the respective epoch during the training process.

Using the aforementioned network architecture and hyperparameter set, we train one model instantiation for each encoding scheme mentioned in section 2.3.4, whereas only the input dimensionality of the encoder cell changes when varying the representation of the input data. Importantly, we focus on positional information as the only input for our LSTM models in this work for reasons of consistency to make all models as comparable as possible. Hence, we neglect for example the history of the target (or ego-) vehicle's velocity or acceleration as input here. Between the two data sets, the only difference between models is the auxiliary data, that is used as additional input to the LSTM decoder cell at each time step. For both data sets, we use the instantaneous velocity of the target vehicle to aid the model predicting the future trajectory at every time step. As there is no ego-vehicle present, we use no further auxiliary data for the NGSIM data set D2. For the On-board data set D1, we use the ego-vehicle's predicted acceleration and the estimated curvature of the ego-vehicle's current lane. Although this is future information, we argue that it is solely about the ego-vehicle, which we expect to be available at the time the prediction is to happen. We assume, that an automated vehicle, in order to safely navigate, will have an estimation of the future lane curvature as well as the acceleration values of its own planned trajectory.



#### 3.1.2. NEF Networks

For our NEF networks, the main parameters influencing the models' performance are the number of neurons in the learning population (i.e., the hidden layer in terms of traditional neural networks), and the neuron model. For simplicity, we use Nengo's rate-variant of the leaky-integrate-and-fire (LIF) neuron model. From the NEF-theory (Eliasmith and Anderson, 2003) we know that increasing the number of neurons in a population yields a more accurate representation of the data encoded in the population's activity. Thus, we expect more accurate predictions when increasing the number of neurons. In our experiments, we found that a number of 3,000 spiking neurons is sufficient and further increasing the number of neurons does not improve the model's prediction accuracy. The neural weights are calculated using Nengo's default least-squares-optimization method with the exception, that we calculate the weights over smaller subsets of the input data for computational reasons.

We train two different variants of our simpler NEF-models using numerical input (NEF SPA numerical) as well as the SPApower-with-ego (NEF SPA 1) and SPA-power encoding (NEF SPA 2) for the On-board and the NGSIM data set, respectively. Here, we adapt the input data such that for the model NEF numerical, we use the vector (xt<sup>0</sup> , . . . , xt<sup>N</sup> , yt<sup>0</sup> , . . . , yt<sup>N</sup> , v) as input with v denoting the instantaneous velocity of the target vehicle. For the NEF SPA 1 and 2 models, instead of flattening the whole sequence of vectors into a giant single vector, we created a single semantic vector by summing the first, middle, and last element of the original vector sequences

$$
\hat{\mathbf{S}} = \mathbf{S}\_{t\_0} \oplus \mathbf{S}\_{t\_{N/2}} \oplus \mathbf{S}\_{t\_N} = (\hat{\mathbf{s}}\_0, \dots, \hat{\mathbf{s}}\_{D-1}).\tag{10}
$$

We only sum up these vectors instead of the whole sequence (**S**t) tN t0 to avoid the accumulation of noise in the vector representation. Note that thereby the NEF model using the SPApower representation does not use the full trajectory history but only selected time steps, namely those visualized in **Figure 2**. To make these simpler models as comparable as possible to the LSTM models in terms of information available to the network, we add the instantaneously velocity v of the target vehicle as an additional element to the input, which yields (sˆ0, . . . ,sˆD−1, v) as input of our model, since there is no intermediate embedding vector here where it could be included.

#### 3.1.3. Mixture-of-Experts Online Learning

There are two different possible variants to our mixture-ofexperts online learning model. One issue of such a learning system is that the actual position information of the target vehicle **v**observed,<sup>t</sup> and thus the error ǫ<sup>t</sup> in Equation (9) is not available at the time the model makes its predictions, since it is future data. In this paper, we show a first prototype that, for simplicity, ignores this delay issue and assumes that position information of the target vehicle **v**observed,<sup>t</sup> actually is available at prediction time. In the future, we aim to investigate an online learning system that updates its weights **W**p,<sup>t</sup> once the error signal ǫ<sup>t</sup> gradually becomes available. However, the architecture of the model itself remains the same. The only difference to the prototype shown here is the time when Equation (9) is applied to update the neural weights. For the context-sensitive mixture-of-experts model, we use information about the current driving situation as identified in section 3.2.1 and **Figure 7** as context **c** for the learning system.

For the NGSIM data set, we use the distance between the targetvehicle and the closest other vehicle as well as the number of surrounding relevant vehicles as context information. Relevant means that those vehicles that are included in the SPA-power representation are counted (see section 2.3.4). For the On-board data set, the distance between the target and the ego-vehicle is additionally included in the context.

In this work, we employ the pre-trained LSTM models using numerical inputs (i.e., LSTM numerical), the best-performing SPA-power encoding scheme for each data set (i.e., LSTM SPA 3 for the On-board data set and LSTM SPA 1 for the NGSIM data set), and a simple linear prediction as input experts for our online learning prototype. For training the model, we simulate online deployment by presenting the offline models' predictions on the validation subsets to the system. Thereby, the individual experts perform their prediction on previously unseen data samples. We conduct individual simulation runs for both data sets.

#### 3.2. Evaluation

In this section, we evaluate the performance of our models and conduct a thorough analysis of the results achieved. For all evaluations in this section, we refer to the longitudinal and lateral direction as x- and y-direction, respectively.

#### 3.2.1. LSTM Models

**Figure 6** visualizes the RMSE of all LSTM-based models on the validation-set V<sup>1</sup> ⊂ D<sup>1</sup> of the On-board data set using 512 dimensional vectors. **Figures 6A,C** show the performance on the complete validation-set in x- and y-direction, respectively, whereas **Figures 6B,D** depict only situations with at least 3 other vehicles present, the distance between the target and the egovehicle being lower than 20 m and the distance between the target and the closest other vehicle being <10 m, again for xand y-direction, respectively. We observe that all approaches yield comparable results with notable differences in certain situations. Although the SPA-power encoding schemes (LSTM SPA 1 and 3) tend to perform worst in x-direction, we observe that they perform better in y-direction in crowded situations with closely driving vehicles with LSTM SPA 3 ranking best along LSTM numerical.

To further investigate this result, we evaluated certain metrics, chosen to characterize crowded and potentially dangerous situations, for items in the validation set, where the LSTM

SPA 3 model outperforms all other approaches with respect to the RMSE in y-direction (see **Figure 7**). We observe that the number of samples, where the distance between the target and the ego vehicle and/or the closest other object being small is significantly higher when the LSTM SPA 3 model outperforms all other approaches. For samples where the LSTM SPA 3 model performs best, the number of samples with a distance <20 m between the target- and ego-vehicle is 50.5 % higher compared to samples where any of the other models performs best. For distances <20 m between the target vehicle and the closest other vehicle, the number of samples is still 11.4 % higher when the LSTM SPA 3 model performs best. Finally, the number of situations with at least 3 other vehicles present is also 7.8 % higher compared to samples where any other model performs best. Thus, we consider these characteristics suitable candidates to serve as context variables on which our onlinelearning mixture-of-experts system could base its weighting decision on. However, we aim to investigate more sophisticated options, such as clustering methods in future work to uncover other, potentially more meaningful features compared to the ones shown in this paper, distinguishing between situations where LSTM SPA 3 performs best compared to another model showing the best performance.

**Figure 8** visualizes the RMSE of all LSTM-based models on the validation-set V<sup>2</sup> ⊂ D<sup>2</sup> of the NGSIM data set for 512 dimensional vectors (**Figures 8A,C**) and for 1,024-dimensional vectors (**Figures 8B,D**). We observe, that all LSTM models achieve a very similar performance (almost identical in ydirection) with LSTM SPA 1 achieving the best performance in x-direction being on par with the numerical encoding for 512-dimensional vectors. For 1,024-dimensional vectors, LSTM SPA 1 even slightly outperforms all other approaches in x-direction, whereas we do not observe significant improvements in y-direction.

#### 3.2.2. NEF Networks

**Figure 9** visualizes the RMSE of our NEF-network models on both data sets. The NEF-network using the SPA-power encoding

schemes processes 512-dimensional for the On-board (NEF SPA 1) and 1,024-dimensional vectors for the NGSIM data set (NEF SPA 2). For reference, we included the performance of the most relevant LSTM models, namely LSTM SPA 1 and 3 for the NGSIM and On-board data set, respectively as well as LSTM numerical, in **Figure 9** as well. We observe that, despite a simpler network architecture and learning algorithm, the NEF-networks achieve a performance comparable to the more sophisticated LSTM models on both data sets. For the NGSIM data set, the NEF SPA 1 model performs on par with its LSTM model counterpart LSTM SPA 3. In this case, the NEF-model is not only simpler, but also has access to less information as its input data is a sum of a subset of the input sequence used for the corresponding LSTM-model.

#### 3.2.3. Mixture-of-Experts Online Learning

**Figure 10** shows the RMSE on selected slices of the validationsets achieved by our context-sensitive mixture-of-experts online learning prototype, which assumes the error signal is available at the time the prediction needs to happen in comparison to the offline models. The four left plots (**Figures 10A,B,E,F**) show two data slices of the validation set D<sup>1</sup> of the On-board data set: **Figures 10A,E** show the RMSE at the start of training process while **Figures 10B,F** show the RMSE performance on the first 70 vehicles. Similarly, the four right plots (**Figures 10C,D,G,H**) show two data slices of the validation set D<sup>2</sup> of the NGSIM data set: **Figures 10C,G** show the RMSE at the start of the training process while **Figures 10D,H** show the RMSE on the first 92 vehicles. From **Figures 10A,E,C,G** we observe, that the model needs some time for adapting its weights yielding a RMSE performance worse than the individual experts for both data sets. However, the model's performance improves quickly and clearly outperforms all individual experts in xdirection while achieving RMSEs as low as the best individual experts in y-direction after a comparably low number of vehicles presented to the system. **Figures 10B,F** illustrate this result for the On-board data set, while **Figures 10D,H** show comparable results achieved by the mixture model on the NGSIM data set.

To get a better idea of how our model weights the individual predictors, we inspect one example driving situation. **Figure 11** visualizes the performance of our mixture-of-experts online learning prototype on one particular example of the Onboard data set. We use a situation not directly after the start of the training process, i.e., the mixture model was already exposed to some vehicles and thus was able to consolidate its weights. **Figures 11A–C** show the driving situation with the vehicles' true trajectories as well as the trajectory predictions

given by the offline models and the mixture-of-experts online learning prototype. **Figures 11D,E** show the absolute error of all approaches while **Figures 11F,G** visualize, how the model weights the individual experts for every prediction time step in this particular driving situation. We observe that the overall trend of our model shows in this example as well. The mixtureof-experts prototype achieves significant improvements in the x-direction while achieving RMSEs comparably low as the best individual expert in y-direction (**Figures 11D,E**). Furthermore, **Figures 11F,G** show that the model weights the expert predictors independently at individual time steps and hence is able to pick the best possible predictor at each time step. However, we also observe, that the error of the mixture-of-experts model in the y-direction is higher than the best individual predictor and that the weighting, especially for later prediction steps, could be improved.

## 4. DISCUSSION

For both data sets used in this paper, we observe that already the simple linear prediction models achieve solid accuracy, especially in longitudinal direction. This makes sense as both data sets almost exclusively contain highway driving situations, which in turn consist mainly of straight driving and rather rare lanechange maneuvers. For straight driving, linear prediction based on a constant velocity assumption is already a solid prediction approach, especially if all dynamic information (position, velocity etc.) are given relative to an already moving ego-vehicle like with the On-board data set D1. **Table 2** summarizes the composition of both data sets.

For the On-board data set, in 86.1 % of all data samples the target vehicle does not perform a lane, i.e., only 13.8 % of all data samples contain a lane change performed by the target vehicle. We further distinguish between lane changes performed during the trajectory history, i.e., the past 5 s before the current time step (labeled as past in **Table 2**) and lane changes that are performed in the future, i.e., the future 5 s from the current time step (labeled as future in **Table 2**). For the NGSIM, the percentage of samples without a target vehicle lane change is 95.1 % while only 4.9 % of the samples contain a lane change performed by the target vehicle at all. The amount of samples containing a future lane change performed by the target vehicle is only 2.6 % of all samples in the NGSIM data set.

For the offline models, simple feed-forward NEF models and more sophisticated LSTM models alike, we observe that most improvements over the linear model are achieved in y-direction. That makes sense as linear prediction is unable to account for

FIGURE 10 | Visualization of the RMSE of the context-sensitive mixture-of-experts online learning system on selected data-slices from the validation sets. The upper row shows the RMSE in *x*-direction (A–D), while the lower row shows the RMSE in *y*-direction (E–H). Panels (A,E) show the RMSE on the *On-board* data set at the start of training process while (B,F) show the RMSE performance on the first 70 vehicles. Similarly, panel (C,G) show the RMSE on the *NGSIM* data set at the start of the training process while (D,H) show the RMSE on the first 92 vehicles.

lane-changes or driving curves, which are mainly characterized by non-linear changes in lateral direction. We found that the LSTM models based on our SPA-power representation (LSTM SPA 1 and 3) achieve promising results on both data sets. However, for the On-board data set, this encoding scheme achieves its best result in crowded and potentially dangerous

FIGURE 11 | Visualization of the context-sensitive mixture-of-experts online learning system on one particular driving situation from the *On-board* data set. Panels (A–C) depict the driving situation with the vehicles' true trajectories as well as the trajectory predictions given by the offline models and the mixture-of-experts online learning prototype. Panels (D,E) show the absolute error of all prediction models on that data-sample. Panels (F,G) visualize, how the mixture model weights the individual experts for every prediction time step in this particular driving situation.



driving situations, without clearly outperforming the other approaches on the whole data set (see section 3.2.1 and **Figure 6**). Given these finding, we investigated situations, where the LSTM SPA 3 model does outperform all other approaches in y-direction and thereby came up with metrics characterizing such crowded situations (see **Figure 7**). This result did not hold that clearly on the NGSIM data set D2, since the LSTM models achieve an almost identical performance in y-direction on this data set.

Nevertheless, we used the identified characteristics as context information for our first prototype of a mixture-of-experts online learning system based on simple delta-rule learning. For simplicity, the prototypical model shown here ignores the fact that measurements of the actual trajectory and thus the error signal for the learning system is future data, i.e., only available with a timing delay, and applies Equation (8) instantaneously. We tested and evaluated this prototype on both data sets achieving comparable results. We found that already shortly after initialization, the online learning system is able to adapt its weights to significantly improve its performance over the individual expert systems. Interestingly, the mixture-of-experts model achieves the most improvements over the individual experts in the x-direction although the characteristics used as context were derived from analyzing the LSTM models' performance in the y-direction. We assume that this is due to the fact, that the individual LSTM experts already show a closer-to-optimal performance in the y-direction with less room for improvements. Furthermore, the sample situation shown in **Figure 11** exemplifies another potential problem of the current model in the y-direction: with a distance of 12.8 m between the target and the closest other vehicle, a distance of 55.8 m between the target and the ego-vehicle and only one other vehicle present, this is not a typical situation for the LSTM SPA 3 model to perform best in the y-direction (cf. **Figure 7**) and thus this expert might not be weighted strongly enough by the model. However, these effects demand for further and more detailed investigation.

Another interesting result of our experiments is the fact, that the simple, feed-forward NEF networks show results comparable to the more sophisticated LSTM models. For those simple models, the SPA-power representation (NEF SPA 1 and 2) shows promising results comparable to the NEF numerical model on the On-board data set and clearly outperforming it on the NGSIM data set (**Figure 9**). Although the NEF models do not clearly outperform the LSTM models (which would be surprising), it is quite remarkable that they achieve results comparable to the more sophisticated models with a simpler network architecture, training procedure and, partly, less information. These results make those simple models using our proposed vector-representation as well as a numerical encoding scheme (possibly in combination with an online learning system like the one proposed in this paper) potential candidates to be deployed on dedicated neuromorphic hardware in mobile applications, as they can be efficiently implemented in a spiking neuron substrate. This could be an interesting, power-efficient approach in future automated vehicles.

## 4.1. Conclusion

In this paper, we showed a novel approach to encapsulate spatial information of multiple objects in a sequence of semantic pointers of fixed vector length. We used a LSTM sequence to sequence model as well as a simple feed-forward spiking neural network to predict future vehicle positions from this representation. For each of those models, we implemented at least one reference model using other encoding schemes to compare their performance to. Furthermore, we compared all our models to a simple linear prediction based on a constant velocity assumption. We evaluated our models on two different data sets, one recorded with on-board sensors from a driving vehicle and one publicly available trajectory data set recorded with an external camera observing a highway segment and conducted a thorough analysis. Finally, we used our pre-trained LSTM networks as basis for a mixture-of-experts online learning prototype and compared its performance to the individual expert systems. We consider our main contributions the proposed representation of spatial information for multiple objects in semantic vectors of fixed length using the convolutive power, the rigorous and detailed analysis of several simple and more advanced models, and the prototype of our online learning system.

## 4.2. Future Work

Although the results presented in this paper show promise, there are several directions for future work. Regarding our LSTM models, we aim to investigate if increasing the vector dimension further leads to improved model performance on the On-board data set, as the results on the NGSIM suggest that there is potential for improvements (see **Figures 8A,B**). Furthermore, our preliminary hyperparameter experiments suggest, that there is potential for improvements by incorporating the history of the target and/or ego-vehicle's velocity and/or acceleration. Therefore, we could investigate possibilities of how to encode such information in a semantic vector substrate. Another interesting option for the offline models is to investigate if a reduced, more balanced data set could improve the models accuracy or at least speed up the training process. As mentioned in section 4 and **Table 2**, both data sets are slightly unbalanced as they are dominated by straight driving, which is most common in highway situations. One possibility could be to use the current data sets and focus the training procedure on "interesting scenes," i.e., situations where for example a lane change is happening by for instance looking for data samples with significant differences in the lateral positions. Another option is to improve our current models to predict a probability distribution of the future positions instead of point predictions of raw position values to take uncertainties into account. Finally, we could also compare our current models to other state-of-the-art models, which combine LSTM and social pooling layers, which we did not include in the work at hand.

Regarding our mixture-of-experts online learning prototype, we have shown a simplified version ignoring the fact that the error signal is future data and thus can not be used instantaneously, but rather becomes gradually available over time. Although the network architecture and learning approach would remain unchanged, the timing when the weights' update happens needs to be implemented and investigated if and how this affects the models performance. However, the results achieved in this paper serve as an upper bound for the performance to be expected from models that have to deal with delayed error signals, that is, that the target vehicle's true motion is future data and thus not available at prediction time, The issue of delayed error signals was mentioned but, for simplicity, not addressed in this work. However, assuming that a model overpredicting the near future most likely will also overpredict for later time steps, we could also experiment with model variants that update the weights for later prediction steps based on the error signal for earlier prediction steps before the error signal actually becomes available. Another direction could be to investigate if and how different context information affect the model's performance.

Since advanced driver assistance systems and, more generally, automated driving are mobile applications with tight energy restrictions, we finally aim to investigate if and how our current implementation could be deployed on dedicated, energy-efficient neuromorphic hardware for mobile, in-vehicle applications.

#### REFERENCES


## AUTHOR CONTRIBUTIONS

FM has designed and implemented all the model variants in Tensorflow and Nengo, designed and performed the experiments, pre-processed data, evaluated results, and wrote the manuscript. PB has designed the numerical LSTM models in Tensorflow and assisted in data pre-processing, experiments and evaluation, and revised the manuscript. TS has designed the models in Nengo, assisted in data pre-processing, experiments and evaluation, and contributed in writing the manuscript. JC coordinated and supervised the research work, and revised the manuscript.


**Conflict of Interest:** FM was employed by BMW AG. PB and TS were employed by Applied Brain Research Inc.

The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Mirus, Blouw, Stewart and Conradt. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.