The DIAMOND Model: Deep Recurrent Neural Networks for Self-Organizing Robot Control

The proposed architecture applies the principle of predictive coding and deep learning in a brain-inspired approach to robotic sensorimotor control. It is composed of many layers each of which is a recurrent network. The component networks can be spontaneously active due to the homeokinetic learning rule, a principle that has been studied previously for the purpose of self-organized generation of behavior. We present robotic simulations that illustrate the function of the network and show evidence that deeper networks enable more complex exploratory behavior.


INTRODUCTION
Deep neural architectures (Fukushima and Miyake, 1980;Hinton et al., 2006) have reached a level comparable to human performance in certain pattern recognition tasks (Krizhevsky et al., 2012). Also in robotic applications, deep networks gain more and more importance, from state abstraction to seamless end-to-end control in complex repetitive tasks (Levine et al., 2016). Moreover, it has been speculated whether deep feed-forward networks can account for some aspects of information processing in the mammalian visual system (Serre et al., 2007), which is not to say that the brain is nothing but a collection of deep neural networks. Quite to the contrary, the brain is known to have dynamical properties that are much richer than standard deep architectures: • Biological neural systems consist of patches of interconnected neurons which also receive re-entrant connectivity via other patches. • Spontaneous behavior can occur at any level of depth and may spread in either direction.
• Sensory inputs are not only providing information for decision about actions, but are also analyzed for effects of previous actions. • A hierarchical organization enables lateral transferability and flexible compositionality. • There is little use for supervised learning.
Based on these considerations, we propose here an architecture that combines the undeniable strengths of deep neural networks with homeokinesis (Der, 2001), an approach to meet requirements of autonomous robots (see section 2). Our work connects to (Carvalho and Nolfi, 2016) where the introduction of flexibility and plasticity in a neural controller showed a good effect in a cleaning task, however, mainly based on an evolutionary approach, whereas we aim at a more principled architecture that achieves an increased flexibility by a hierarchy of identical controllers. The autonomously generate activity of higher-lever controllers provide an intrinsic motivation (Oudeyer et al., 2007) for the lower ones. In this way, we are able to propose a more brainlike architecture which implicitly realizes a predictive coding principle, compare (Adams et al., 2013) for a related approach, at least in some parameter range, as discussed below. An early interesting comparison is provided by (Rusu et al., 2003) which presents a neuro-fuzzy controller for determining the behavior of a robot in a navigation task. Their architecture had a similarly layered structure, although the behaviors had to be predefined at a time when homeokinesis (Der, 2001) was just being developed. More recently, differential Hebbian learning was used to explore possible behaviors of a robot (Pinneri and Martius, 2018), presenting a more brain-like approach at the low level, whereas we aim a model that captures characteristics of the area-level organization of the brain.
In the following, we will consider first the homeokinetically controlled sensorimotor loop (Der, 2001) as the basic element of the proposed system (section 2). In this way, we incorporate a source of spontaneous activity. The composition of these elements in the DIAMOND (Deep Integrated Architecture for sensoriMotor self-Organization aNd Deliberation) architecture (section 3) will thus be able to generate activity at all levels and work in a fully self-supervised way, although it is also possible to steer the system to desired behavior by very small guiding inputs (Martius and Herrmann, 2011). The main layout of the architecture includes a basic layer that receives information from outside world and sends actions and is expected to represent low-level features. There is a variable number of deeper layers that interact only with the neighboring layers and which represent more abstract features that are extracted from the data through the lower layers. The architecture learns by the homeokinetic learning rule (see below) which implies that consistency between neighboring layers is required. We will present a few experimental results in section 4, and discuss the realism and performance of the architecture as well as further work in section 5.

HOMEOKINETIC CONTROL
The basic element of our architecture is formed by a homeokinetic controller, which we will describe here only briefly, details can be found in (Der and Martius, 2012). This unsupervised active learning control algorithm shapes the interaction between a robot and its environment by updating the parameters of a controller and of an internal model. The learning goal can be characterized as a balance of predictability and sensitivity with respect to future inputs. The resulting behavior is random yet coherent both temporally and across multiple degrees of freedom. The controller is a parametric function of the vector x t of current sensory states of the robot. It generates a vector of motor commands y t in dependence on the current values of the parameter matrix C. The update of the parameters is based on the sensitivity of the distance between inputs and their predictions by means of an internal model. This model produces a prediction of future statesx t+1 based on the current input x t or action y t or both, and a parameter matrix M.
The difference between actual and estimated state defines the prediction error which gives rise to one of the two complementary objective functions that are relevant here, firstly the prediction error which is used to adapt the parameters M of the internal model (2), and secondly the time loop error which is based on a post-dictionx t of previous input x t obtained via the inverse of Equation (2) given the new input x t+1 , i.e., E t is calculated only at time step t + 1, and is related to the prediction error (4) by where J is the linearization of the maps from current input to next input dependent on the current controller. As only the projection η of J −1 on ξ is relevant, the time loop error can be efficiently estimated. The homeokinetic learning rule updates the parameter matrix C of the controller (1) by gradient descent where C ij is an element of C and ε C is a learning rate. If the representational power is of less importance than the flexibility (Smith and Herrmann, 2019), then a simple quasilinear system can be considered as sufficient. Below, when we will consider a multi-layered system, the representational power is meant to be achieved by the interaction between the layers each of which will consist of one instance of the current controllerpredictor unit. A pseudo-linear controller, i.e., a quasi-linear function of the inputs with coefficients that are adaptive on the behavioral time scale, and a linear modelx does thus not limit the complexity of achievable control. The parameters of the controller and the model are now the matrices C and M resp., which are complemented by the matching bias vectors c and m. In order to incorporate limitations of actions FIGURE 1 | Schematic representation of multi-layer homeokinetic learning. Left: In the elementary sensorimotor loop, a control action y 0 is calculated by the controller C 1 and executed in the environment W which then produces the new inputx 0 . The prediction error is obtained as the difference of new sensory inputx 0 and its predictionx 1 that was obtained from the previous inputx 0 . It is used in the update of the model M, see Equations (13), (14). Right: In homeokinetic learning, the time-loop error, i.e., the difference of previous input x 0 and re-estimated previous input x 1 (which is obtained via the downwards arrows and corresponds tox t in Equation 5), is used to update the controller parameters, see Equations (11), (12). The curved downward arrows indicate the time step: The "new" input that was previously predicted or obtained from the environment, is now used by the controller to produce the next action (rather than the re-estimated input). The inner layers follow exactly the same dynamics based on predictions from the respective outer layers rather than based on the environment. Top-down effects are included by additional connections This includes virtual actions (arrows from y i to M i ) analogous to the initiation of actions in the environment, and virtual states taken into account by the controller (arrows from x i to C i ). The activities are propagated alternatingly through the upwards (orange, violet, and brown) arrows and through the respective transposed matrices via downwards arrows (cyan), both of which correspond to a set of parallel fibers, whereas the adaptive interconnections are maintained in the controller (C nodes) or the model (M nodes).
of the robot, the controller is quasi-linear due to the elementwise sigmoidal function g. Because of the simple structure of Equation (8), we can omit here the state dependency (2) and define the model M only in motor space. The parameter update (7) becomes and analogously for the bias term c. With µ = G ′ M ⊤ J ⊤ −1 η and ζ = Cη the learning rules for a linear controller with a linear model are Simultaneously, but possibly with a different learning rate, the parameters M of the linear model (9) are updated via gradient descent on the standard prediction error (Equation 4, rather than Equation 6).
where ε M is the learning rate for the adaptation of the internal model. The ratio of the two learning rates ε C and ε M is known to be critical for the behavior of controlled robot (Smith and Herrmann, 2019). For the architecture presented next, an optimized ratio is to be used, see also

Deep Homeokinesis
The DIAMOND model is a generalization of the homeokinetic controller described in section 2. As shown in Figure 1, the comparison of a state variable x (t) and its estimatex (t) is now repeated also for estimates of estimates etc., x 0 (t) = x (t), x 1 (t) =x (t), x 2 (t), . . . , where each pair of neighboring layers corresponds to a homeokinetic controller that acts onto the lower layer as its environment and receives biases from the higher layer.
In the inner layers (larger ℓ) the external information becomes less and less dominant. In order to use homeokinetic learning in a multilayer architecture, several instances of the homeokinetic sensorimotor loop are stacked. The internal model of any lower layer serves as the "world" for the next higher layer. Likewise, estimates for input obtained at by a lower layer are the inputs for the higher layers, so each layer reproduces the elementary loop shown in Figure 1.

Simple Variant
The architecture consists of controllers for each layer ℓ < L (no controller for ℓ = L) and linear models that are given bŷ which simplifies for the top layer ℓ = L whereỹ L (t) ≡ 0, i.e., no higher effects are present. In Equation (16) also the effect of virtual actionsỹ ℓ (t), ℓ ≥ 1 is included as follows: First, the previous prediction of a layer x ℓ (t − 1) is copied into the input unit x ℓ (t) at the beginning of the new time step, see Figure 1. The back-propagated inpuť x ℓ (t − 1) that was used in Equations (5) and (6) is no longer needed. From x ℓ (t) a virtual action y ℓ (t) is computed that then contributes additively to the prediction (16). The controller update is here the same as for the one-layer model, and theM matrix (not shown in the figures) is updated in the same way as the M matrix.

Main Variant
The variant with extra connections (Figure 1) has for the controller = g C ℓ+1 x ℓ (t)+C ℓ+1xℓ+1 (t − 1)+c ℓ+1 i.e., in the same way as new inputx 0 (t + 1) that is used to calculate the prediction error is also used in the next time step as input x 0 (t), we are also for ℓ > 0 using previous predictions as new virtual input. For the deepest layer ℓ = L, Equation (17) is not applied, and for the penultimate layer we have simply y ℓ (t) = C ℓ+1 (x ℓ (t)) = g (C ℓ+1 x ℓ (t) + c ℓ+1 ) .
For the model, Equation (16) is used as above.
While the first C matrix in Equation (17) is adapted learned in the standard way (see Equations 11 and 12), the matrixC is updated by gradient descent with respect to the prediction error for the action i.e., the inputx ℓ+1 (t − 1) from the more inner level is used to predict the motor output y ℓ (t). The update equations forC are similar to Equations (13) and (14), but also contains a derivative of g. Note that no loops are present in the network of Figure 1, which may not be a problem as the loops have no function (yet), and may be included later. However, it is not clear what "deliberation" could mean without these loops. We assume that the inner (deeper) layers are updated first. The deepest layer ℓ = L has no variables, just the controller and the model. According to Equation (18), no higher-level input variables are needed in order to update the variables at ℓ = L − 1. In this way, virtual actions and virtual inputs are available to be used in Equations (17) and (16) to update the next layer toward the outer side, i.e., with lower ℓ. For the update of the matrices M,M, C andC the time order is not essential, if the variables are calculated as described above.

Main Variant With Deep Associations
As a further variant, which is, however, not implemented in the present simulations, a standard deep neural network can be employed to connecting the inputs x ℓ directly between neighboring levels. In this case a separate set of connections P ℓ would be learned for map from x ℓ−1 to x ℓ . The weights P are learned by the activations x ℓ that arise due to the activations of the network. In addition it is possible to add a further set of connections R that play the same role as P, but for the predicted sensor values.
The network can sustain persistent activity that represents an action perception cycle. Activity in the subnetworks that are completed by recurrent connections arises by self-amplification of noise or spurious activity following the homeokinetic learning of the respective controller. It may be possible to use also the cycles more explicitly for learning, but we want to restrict ourselves here to one-step learning rule, i.e., gradients are calculated only over one The full model also includes perceptual pathways consisting of bridges between input-related units. In this way the network activity becomes shaped by standard deep feed-forward networks.

Active Response by the Recurrent Network
As a first test, we have considered the simple variant of the architecture (see section 3.2) when it is driven with a sinusoidal input and the "world" reproduces simply a noisy version of the motor action as next input to the robot. Typical results are shown in Figure 2 for a two combinations of the learning rates ε C (11, 12) and ε M (13, 14), which lead either to an abstracted reproduction of the input in the deeper layers or to a selforganization of activity that, however remains without effect in this simple variant. At lower learning rates (left column), even deeper layers respond to the original input. In this case, the internal layers are square versions of the original input. For larger learning rates (right column), the internal layers have a different response. The fifth row shows a combination of homeokinetic adaptation (the red line between 310 and 320 s) and noisy output while still following the input from the first layer. Deeper layers (lower rows), tend have a decay in the generation of motor action attributed to the squashing function.
FIGURE 2 | Activity evolution in a perceptually connected network structure according to the model in section 3.2. The sensory trajectory is shown by the solid line (red) and the intermediate motor action by the dashed line (green). The top row gives the input activity, the second row the activity of the first layer and the following rows show every 10th layer of the architecture to a total depth of 50. The left panel is for learning rates ε M = 0.01, ε C = 0.05, and the right one for ε M = 0.1, ε C = 0.2. While at low leaning rates, the input is similar across all layers, for larger ratios ε M /ε C the model is more flexible and the deeper activity becomes largely independent on the input, which allows for self-organized activity in the deeper layers that is not immediately affecting the outside world.

A Wheeled Robot in the Hills
The main variant (section 3.3) is used in an exploration task, where a four-wheeled robot is expected to cover a large portion of an unknown territory (Smith and Herrmann, 2019). The hilly landscape shown left in Figure 3 can be scaled in vertical direction such that different levels of difficulty can be achieved ranging from a flat ground (level 0) to slopes that require maximal motor power (level 1) and that can cause instabilities and thus large prediction errors (4). The activity decays in a fivelayer DIAMOND model for a flat arena, as the inner layers are not needed, whereas for a hilly landscape (difficulty level > 0) the inner layers did not show much attenuation. The behavior of the robot is evaluated based on a 10 × 10 grid overlaid to the square-shaped arena. The number of visited grid cells is averaged over five runs for each difficulty and each controller depth and represented as a coverage rate. The total coverage was in all cases below 50% such that the increase of the coverage with time was nearly linear.
Whereas a single layer can achieve a similar performance across all terrain difficulties, for increasing difficulty of the task the higher layer are more and more engaged and take advantage of the increased errors in the terrain that provide thus a potential for a more comprehensive coverage of the arena per time unit.

A Spherical Robot in a Polygonal Arena
Finally, we studied a simulated spherical robot which is controlled by three masses that a movable along internal axes, see Figure 4, left. The robot is exploring freely in an polygonal environment which was chosen to discourage circular movement along the wall. The controller picks up quickly a suitable rhythm of the internal weights that is effecting in moving the robot in any direction. Collisions with wall usually stop the robot until the emergence of a different mode of the movements of the internal weights moves the robot in a different direction. Although a more systematic study is yet to be performed, it is already obvious that adding a small number of additional layers increases the behavioral repertoire of the robot and reduces the duration of any wall collisions and re-emergence of behavior in the robot. The example is also meant to demonstrate, that the applications of the learning rule and architecture are beyond exploration of a planar arena and can be used in order to generate and to organize elementary robotic behaviors.

DISCUSSION
The numerical results seem to imply that a few layers are sufficient, i.e., a larger number of layers does not lead The panel on the right shows results for five levels of difficulty (linear scaling of the slopes, with the simples level being a flat ground) and five depths of the network (ℓ = 1, 2, 3, 4, 5) are considered, showing an increased exploration capability. The code for simulator (Der and Martius, 2012) and the DIAMOND controller architecture described here is available at https://github.com/artificialsimon/diamond. to further improvements or may require a much longer learning time than attempted here. It should, however, be considered that the tasks and environments are all very simple, such that it is not possible to generalize this observation to more complex situations. It can nevertheless be expected that the spontaneous internal activations that were observed for suitable learning rate ratios, lead to a learning time that is approximately linearly increasing with the number of layers, and not much worse. This is suggested by earlier results with homeokinetic learning rule (Martius et al., 2007).
The present model is a representation of the idea (see e.g., Anderson et al., 2012) that it is difficult to define a clear boundary between brain and body or even between body and world. At all layers the system follow the same principles in its adaptation of the actions onto lower layers and in the learning of a model that affects higher layers. The reduction of complexity of the internal dynamics toward higher layers is counterbalanced by the autonomous activity such that the main eigenvalue at each layer will hover near unity (Saxe et al., 2014).
Although the activity is updated here in parallel in all layers, the stacked structure is clearly similar to the subsumption architecture (Brooks, 1986) as it allows for shorter or longer processing loops. It remains to be studied whether more general architectures are beneficial, especially when more complex tasks are considered.
In Figure 1, it is understood that the dynamical variables (x, y, andx) exist each in two instances, one updated by the controlling and predictive pathways, the other by the feedback within the re-estimation system. The need to disambiguate these units points to an interesting parallel to the roles of the layers of the mammalian cortex.
Finally, it should be remarked the principle of predictive coding is inherent in the architecture from the homeokinetic principle. Activity can only travel to the deeper layers if it is not already predicted by the internal model of the current layer. In some cases this can lead to a complete decay of the activity in the deeper layers (see Figure 3), although more complex robots and more challenging environments need to be studied in order to precisely identify parallels to the predictive coding principle in natural neural systems.

DATA AVAILABILITY STATEMENT
All datasets generated for this study are included in the article/supplementary material.