Linear combination of one-step predictive information with an external reward in an episodic policy gradient setting: a critical analysis

One of the main challenges in the field of embodied artificial intelligence is the open-ended autonomous learning of complex behaviors. Our approach is to use task-independent, information-driven intrinsic motivation(s) to support task-dependent learning. The work presented here is a preliminary step in which we investigate the predictive information (the mutual information of the past and future of the sensor stream) as an intrinsic drive, ideally supporting any kind of task acquisition. Previous experiments have shown that the predictive information (PI) is a good candidate to support autonomous, open-ended learning of complex behaviors, because a maximization of the PI corresponds to an exploration of morphology- and environment-dependent behavioral regularities. The idea is that these regularities can then be exploited in order to solve any given task. Three different experiments are presented and their results lead to the conclusion that the linear combination of the one-step PI with an external reward function is not generally recommended in an episodic policy gradient setting. Only for hard tasks a great speed-up can be achieved at the cost of an asymptotic performance lost.


Introduction
One of the main challenges in the field of embodied artificial intelligence (EAI) is the open-ended autonomous learning of complex behaviours.Our approach is to use task-independent, information-driven intrinsic motivation to support task-dependent learning in the context of reinforcement learning (RL) and EAI.The work presented here is a first step into this direction.RL is of growing importance in the field of EAI, mainly for two reasons.First, it allows to learn the behaviours of high-dimensional and complex systems with simple objective functions.Second, it has a well-established theoretical [Sutton andBarto, 1998, Bellman, 2003] and biological foundation [Dayan and Balleine, 2002].In the context of EAI, where the agent has a morphology and is situated in an environment, the concepts of the agent's intrinsic and extrinsic perspective rise naturally.As a direct consequence, several questions about intrinsic and extrinsic reward functions, denoted by IRF and ERF, follow from the EAI's point of view.The questions that are of interest to us are; what distinguishes an IRF from an ERF, what is a good candidate for a first principled IRF, and finally, how should IRFs and ERFs be combined?
The first question, of how to distinguish between IRF and ERF is addressed in the second section of this work, which starts with the conceptual framework of the sensorimotor loop and its representation as a causal graph.This leads to a natural distinction of variables that are intrinsic and extrinsic to the agent.We define an IRF that models an internal drive or motivation as a task-independent function which operates on the agent's intrinsic variables only.In general, an ERF is a task-dependent function that may operate on intrinsic and extrinsic variables.
The main focus of this work is the second question, which deals with finding a first principled IRF.We propose the predictive information (PI) [Bialek et al., 2001] for the following reasons.Information-driven self-organisation, by the means of maximising the one-step approximation of the PI has proved to produce a coordinated behaviour among physically coupled but otherwise independent agents [Zahedi et al., 2010, Ay et al., 2008].The reason is that the PI inherently addresses two important issues of self-organised adaptation, as the following equation shows: I(S t ; S t+1 ) = H(S t+1 ) − H(S t+1 |S t ), where S t is the vector of intrinsically accessible sensor values at time t.The first term leads to a diversity of the behaviour, as every possible sensor state must be visited with equal probability.The second term ensures that the behaviour is compliant with the constraints given by the environment and the There are three main reasons why we prefer to experiment with embodied agents (EA).First, scalability: EA are high-dimensional systems which live in a continuous world.Hence, the algorithms face the curse of dimensionality if they are evaluated on different EAs.Second, validation: we are interested in understanding natural cognitive systems by the means of building artificial agents [Brooks, 1991].Using EA ensures that the models are validated against the same (or similar) physical constraints that natural systems are exposed to.Third, guidance: there is good evidence that the constraints posed by the morphology and environment can be used to reduce the required controller complexity, and hence, reduce the size of the search space for a learning algorithm [Zahedi et al., 2010, Pfeifer andBongard, 2006].Consequently, understanding the interplay between the body, brain and environment, also called the sensorimotor loop (SML, see Fig. 1), is a general focus of our work.The next paragraph will introduce the general concept of the SML and discuss its representation as a causal graph.A cognitive system consists of a brain or controller that sends signals to the system's actuators, which then affect the system's environment.We prefer the notion of the system's Umwelt [von Uexkuell, 1934, Clark, 1996, Zahedi et al., 2010, Zahedi and Ay, 2013], which is the part of the system's environment that can be affected by the system, and which itself affects the system.The state of the actuators and the Umwelt are not directly accessible to the cognitive system, but the loop is closed as information about both, the Umwelt and the actuators are provided to the controller by the system's sensors.In addition to this general concept, which is widely used in the EAI community [see e.g.Pfeifer et al., 2007], we introduce the notion of world to the sensorimotor loop, and by that we mean the system's morphology and the system's Umwelt.We can now distinguish between the agent's intrinsic and extrinsic perspective in this context.The world is everything that is extrinsic from the perspective of the cognitive system, whereas the controller, sensor and actuator signals are intrinsic to the system.
The distinction between intrinsic and extrinsic is also captured in the representation of the sensorimotor loop as a causal or Bayesian graph (see Fig. 1, right-hand side).The random variables C, A, W , and S refer to the controller state, actuator signals, world and sensor signals, and the directed edges reflect causal dependencies between the random variables (see [Klyubin et al., 2004, Ay and Polani, 2008, Zahedi et al., 2010]).Everything that is extrinsic to the system is captured in the variable W , whereas S, C, and A are intrinsic to the system.
In this context, we distinguish between internal and external reward function (IRF, ERF) in the following way.An ERF may access any variable, especially those that are not available to an agent by its sensors, i.e. anything that we summarised as the world state W .An IRF may access intrinsically available information only (S t ,A t ,C t , see Fig. 1).We are interested in first principled model of an intrinsic motivation, i.e. a model that requires as few assumptions as possible.The idea is that IRF should not depend on a specific task but rather be a task-independent internal driving force, which supports any task-dependent learning.This is why we refer to it as task-independent internal motivation or drive.This closes the discussion of embodied agents and their formalisation in terms of the sensorimotor loop.The next section describes the information-theoretic measures that are used in the remainder of this work.

Predictive Information
The predictive information (PI) [Bialek et al., 2001], which is also known as excess entropy [Crutchfield and Young, 1989] and effective measure complexity [Grassberger, 1986] is defined as the mutual information of the entire past and future of the sensor data stream: where S p = {S 1 , S 2 , . . ., S t } is the entire past of the system's sensor data at some time t ∈ N and S f = {S t+1 , S t+2 , . ..} its entire future.The PI captures how much information the past carries about the future.Unfortunately, it cannot be calculated for most applications because of technical reasons.Hence, we use the one-step PI, which is given by which was previously investigated in the context of EAI [Ay et al., 2008] and as a first principle learning rule [Zahedi et al., 2010, Martius et al., 2013].A different motivation for the PI is based on maximising the mutual information of an intention state St , which is internally generated by the agent, and the next sensor state S t+1 [Ay and Zahedi, 2013].The Equation (2) displays how maximising the PI affects the behaviour of a system.The first term in Equation ( 2) leads to a maximisation of the entropy over the sensor states.This means that the agent has to explore its world in order to sense every state with equal probability.The second term in Equation (2) states that the uncertainty of the next sensor state must be minimal if the current sensor state is known.This means that an agent has to choose actions which lead to predictable next sensor states.This can be rephrased in the following way.Maximising the entropy H(S t+1 ) increases the diversity of the behaviour whereas minimising the conditional entropy −H(S t+1 |S t ) increases the compliance of the behaviour.The result is a system that explores its sensors space to find as many regularities in its behaviour as possible.
For completeness we will also maximise the entropy H(S t ) only and compare the results to the maximisation of the PI.This concludes the presentation of the PI (and entropy) as a model for a task-independent internal motivation in the context of RL.The next section presents the utilised RL algorithm.

Policy Gradients with Parameter-Based Exploration (PGPE)
We chose an episodic RL method named PGPE [Sehnke et al., 2010] to investigate the effect of the PI as an IRF, because it is not restricted to a specific class of policies.Any policy, which can be represented by a vector µ ∈ R n with fixed length n ∈ N + can be optimised by this method.In the work presented here, we use it to learn the synaptic strengths and bias values of neural networks with fixed structures only.Nevertheless, we can apply the framework to other parametrisations, in particular to stochastic policies, which is why PGPE attracted our attention for ongoing the project in which this work is embedded.
The algorithm can be summarised in the following way (for details, see [Sehnke et al., 2010]).In each roll-out or episode, two policy instances are drawn from µ by adding and subtracting a random vector ∼ N (0, σ) to it.The resulting two policy parametrisations Θ + = µ + and Θ − = µ − are then evaluated and their final rewards r + , r − are used to determine the modifications on µ and σ according to the following equations Roll-outs can be repeated several times before a learning step is performed.Every learning step concludes a batch.PGPE requires an initial µ init , an initial σ init , a learning rate α, baseline b, baseline adaptation parameter δ, and an initialised maximal reward m = m init .We have set δ to the recommended value of 0.1, µ init = 0, and we have achieved the best results in all experiments by setting m init small enough that m is definitely overwritten in the first roll-out (see Eq. ( 3)).The other parameters are evaluated in each experiment, such that the best results were achieved when no IRF was used and then fixed for the remaining experiments.

Results
This section presents three different experiments and their results.The first experiment is the cart-pole swing-up, a standard control theory problem that is also widely used in machine learning [Barto et al., 1983, Geva and Sitte, 1993, Doya, 2000, Pasemann et al., 1999].The cart-pole experiment is also chosen because balancing a pole minimises the entropy, and hence, it contradicts the maximisation of the PI.The second experiment is the learning of a locomotion behaviour for a hexapod and it was chosen to demonstrate the effect of the PI maximisation on a more common, well-structured experimental setting.By well-structured we mean that the controller, morphology, environment, and ERF are chosen such that they result in a good hexapod locomotion without any additional support by an IRF in only a few policy updates.The third experiment is designed to be challenging, as it combines a high-dimensional system, an unconventional control structure, an unsteady ERF with an unsteady environment.We believe that these three experiments span a broad range of possible applications for information-theoretic IRF in the context of episodic RL.

Cart-Pole Swing-Up
The cart-pole swing-up experiment is ideal to investigate the effect of the PI on an episodic RL task, mainly for two reasons.First, the experiment is well-defined by a set of equations and parameters, that are widely used in literature [Barto et al., 1983, Geva and Sitte, 1993, Doya, 2000, Pasemann et al., 1999].This ensures that the results are comparable and reproducible by others with little effort.Second, the successful execution of the task contradicts the maximisation of the PI.The task is to balance the pole in the centre of the environment, and hence, to minimize the entropy of the sensor states.The maximisation of the PI demands a maximisation of the entropy (see Eq. ( 2)).The The experiment was conducted by implementing the equations that can be found in [Barto et al., 1983, Geva and Sitte, 1993, Doya, 2000].The state of the cart-pole is given by x, ẋ, ϑ, θ, which are the position of the cart, the speed of the cart, the pole angle and the pole's angular velocity.The cart is controlled by a force F ∈ [−10N, 10N ] that is applied to its centre of mass.The four state variables and the force define the input and output configuration of our controllers for this task.The initial controller (see Fig. 2A) was chosen from [Pasemann et al., 1999], where network structures were evolved for the same task.To ensure that the evolved structure is not especially unsuitable for RL, different variations were chosen for evaluation too (see Fig. 2B-D).In this approach, the input neurons are simple buffer neurons, with the identity as transfer-function, whereas all other neurons use the hyperbolic tangent transfer-function.
The evaluation time was set to T = 2000 iterations, which corresponds to 20 seconds (c.f.[Doya, 2000]).Different values, starting from the values proposed in [Sehnke et al., 2010], for the learning rate α ∈ {0.1, 0.2, 0.5}, the initial variation σ init ∈ {2, 5}, and the initial maximal reward m init ∈ {−∞, 10, 100, 1000} were evaluated in experiments without applying an IRF to the learning of the task.The underlined values showed the best results, and hence, are chosen for presentation here.Each experiment consisted of B = 10000 batches, i.e. updates of µ and σ (see Eqs. ( 5) and ( 6)) with two roll-outs each (i.e.four evaluated policies θ +,− 1,2 ).The results are obtained by conducting every experiment 100 times.To ensure comparability among the experiments with different parameters and controllers, the random number generator was initialised from a fixed set of 100 integer values for each experiment.
The presentation of the reward function is split into two parts.The first part handles the ERF, whereas the second part handles the IRF.We use the terms intrinsic/internal and extrinsic/external with respect to the agent's perspective, as discussed in the previous section (see Sec. 2.1).The controller has access to the full state of the system, and hence, the separation into internal and external is artificial in this case.Nevertheless, we keep this terminology for consistency, as the next experiments will reflect this distinction in a natural way.We denote IRF by R in and ERF by R ex , where a super-script is added to distinguish between the different reward functions (PI and entropy).
The ERF for the cart-pole swing-up task is defined such that it is not a smooth gradient in the reward space, and therefore, does not directly guide the learning process.The controller is only rewarded if the pole is pointing upwards and the reward is scaled with the distance of the pole to the center of the environment, which is given by The IRF is calculated at the end of each episode based on the recordings of the pole angles {S t = ϑ(t)|t = 1, 2, . . ., T }.We use a discrete-valued computation of the PI, and hence, the data is binned prior to the calculation.All IRFs are normalised with respect to their theoretical upper bound of I(S t+1 ; S t ) ≤ H(S t ) ≤ log |S| (see [Cover and Thomas, 2006]).This leads to the two following IRFs: The overall reward functions are then given by where β(γ) is a factor to scale the IRF with respect to the maximal possible value of the ERF.This allows us to compare the effects of R PI in and R H in across different experiments.The results are discussed only for the fully connect feed-forward network (see Fig. 3A-D) in detail as this controller shows the most distinguishable results with respect to the variation of the IRF scaling parameter γ ∈ {0%, 1.25%, 2.5%, 3.75%, 5%}.It is important to note that the plots only show the averages of the 100 experiments and not the standard deviation for the following reason.Few controllers succeed early, others later during the process.Due to the unsteady ERF the resulting standard deviation is very large, as those controllers that succeed receive significantly higher reward compared to those not succeeding (which remain close to zero, as a rotational behaviour is not permitted).We intentionally chose an unsteady ERF, that returns zero for almost all states, and hence, we know beforehand, that the standard deviation is large and no further information is provided if it is plotted.
Figures 3A and 3B show the progress of the ERF R PI ex and IRF R PI in for the PI maximisation.It is shown that there is a significant speed-up in learning during the first 4000 batches for all γ > 0% (see Fig. 3A).At this point in time the average ERF of γ = 0% succeeds that of γ = 5%.After approximately 5000 batches the ERF for γ = 2.5% and γ = 3.75% are very close to or slightly succeeded by the ERF for γ = 0%, whereas the ERF for γ = 1.25% remains higher.The conclusion from this experiment is that small values of γ < 5% are beneficial in this learning task as less batches are required to solve this task and the asymptotic learning performances are almost identical to γ = 0%.The results, however, are not significant and the choice of γ is critical.This leads to the conclusion that the one-step PI is not significantly beneficial in the learning of this task.
Figures 3C and 3D show the progress of the ERF R H ex and IRF R H in for the entropy maximisation.The results show a different picture.Any parameter γ > 0% speeds up the learning and improves the overall performance.The comparison of entropy and PI is addressed in the discussion again.

Hexapod Locomotion
If a specific task should be learned by an embodied agent, it is more common to choose an environment, morphology, control structure and a smooth ERF which are well-suited for the desired task.In order to investigate which effect the PI has on such a well-defined learning task, the set-up of the experiment presented in this section is chosen such that all components are known to work well if there is no IRF present.The goal is to learn a locomotion behaviour of a hexapod, where the maximal deviation angles ensure that it cannot flip over.The controller is known to perform well in a similar task [Markelić and Zahedi, 2007] and its modularity significantly reduces the number of parameters that must be learned.The ERF defines a smooth gradient in the reward space, ensuring that small changes in the controller parameters show an immediate effect in the ERF.The environment is an even plane without any obstacles.
The experimental platform (see Fig. 4) is a hexapod, with 12 degrees of freedom (two actuators in each leg) and with 18 sensors (angular positions of the actuators and binary foot contact sensors).The two actuators of each leg are positioned in the shoulder (Thorax-Coxa or ThC joint) and in the knee (Femur-Tibia or FTi joint) of the walking machine, similar to the morphology presented in [von Twickel et al., 2011].We omit the second shoulder-joint (CTr) because it is not required for locomotion.Each joint accepts the desired angular position as its input and returns the actual current angular position as its output.The simulator YARS [Zahedi et al., 2008] was used for all experiments conducted in this section.
Different values for the PGPE parameters were evaluated.The best results for γ = 0 (see Eq. ( 9)) were achieved with σ init = 2 and α = 0.1.To ensure comparability with the previous experiment, two roll-outs were chosen here, although it is not required to obtain the following results.The evaluation time was set to T = 1000 and B = 250 batches were sufficient to observe a convergence of the policy parameters µ.The values for γ were chosen from the previous experiment.
The ERF is calculated once at the end of each episode and it is defined as the euclidean distance between the hexapod at time T and its initial position (0, 0) projected onto the xy-plane: where (x T , y T ) are the coordinates of the centre of the robot in world coordinates at time t = T .The IRF is calculated differently compared to the previous experiment.In a high-dimensional system as the hexapod, it is not possible to compute the PI of the entire system with a reasonable effort, as the computational cost of I(S t ; S t+1 ) grows exponentially for every new sensor.It would be natural to reduce the computational cost by calculating the PI based on a model of the morphology, but this would violate our claim that the PI incorporates the morphology without the need of explicitly modelling it.Hence, we decided to use the following method to approximate the PI and the entropy H (see Fig. 4D).Let S i (t), i = 1, 2, . . ., 12, be the angular position sensors for the 12 actuators.We then chose two sensors k, l with 1 ≤ k, l ≤ 12, k = l, randomly from the 12 possibles sensors, and calculated The overall PI and entropy are then calculated as the sum of n randomly chosen P I u and H u pairings, with the additional constraint that each sensor pair k, l appears only once in the approximations.The resulting IRFs are then given by: R PI in := where n is the number of pairings.For n > 20 no difference was found for the approximated PI, which is why n = 20 was chosen for the remainder of this work.
The overall reward functions are then given by: where β(γ) is defined as in the cart-pole swing-up experiment (see Eq. ( 9)).
A common recurrent neural network central pattern generator layout is chosen, which can also be found in literature [e.g.Campos et al., 2010, von Twickel et al., 2011, Markelić and Zahedi, 2007], thereby using the same neuron model as in the cart-pole experiment (see above).As all legs in the hexapod are morphologically equivalent, only the synaptic weights of one leg controller are open to parameter adaptation in the PGPE algorithm.The values are then copied to the other leg controllers.This reduces the number of parameters for the entire controller to 32 (see Figs. 4B and 4C).
The results (see Fig. 5) show that neither the PI nor the entropy have a noticeable effect on the learning performance.The mean values of the 100 experiments for each parameter as well as the standard deviation are almost identical.This point will be addressed in the discussion of this work (see Sec. 4).

Hexapod Self-Rescue
The third experiment is designed to combine and extend the two previous experiments.It combines them as a highdimensional morphology, similar to that used in the locomotion experiment, is trained with an unsteady ERF, which is similar to that used in the cart-pole experiment.It extends the previous experiments as the number of parameters in the controller is a magnitude larger and because an unconventional control structure is used for the desired task.The most distinctive difference to the previous experiments is the non-trivial environment.The next paragraphs will explain the experimental set-up in detail before the section closes with a discussion of the results.
We used the simulated hexapod robot of the LpzRobots simulator [Martius et al., 2012].The hexapod has 12 active and 16 passive degrees of freedom (see Fig. 6).The active joints take the desired next angular position as their input and deliver the current actual angular position as their output.The controller is a fully connected one-layer feed-forward neural network without lateral connections and the hyperbolic transfer function a t+1 = tanh(W s t + v), where a t+1 and s t are the next action and the current sensor values, W is the connection matrix, and v is the vector of biases.The resulting controller is parameterised by 156 parameters, 144 for the synaptic weights and 12 for the bias values.Note, that the controller is generic and has no a priori structuring or other robot-specific details.
The task for the hexapod is to rescue itself from a trap.For this purpose, it is placed in a closed rectangular arena (see Fig. 7).The difficulty of the task is determined by the height of the arena's walls, denoted by h ∈ {0.0m, 0.1m, 0.2m} (see Fig. 6).For comparison, the length of the lower leg (up to the knees) is 0.45m.The sizeproportion of the robot and the trap can be seen in Fig. 6B.
The ERF is given by where r is the radius of the trap (Fig. 6) and (x T , y T ) is the position of the centre of the robot in world coordinates at the end of a roll-out (t = T ).The IRFs and overall reward functions are identical to those used in the previous experiment (see Eqs. ( 11) and ( 12)).
As before, the performance of PGPE with γ = 0 for different values for σ init and α were evaluated, and the best are chosen for presentation here, which are σ init = 2 and α = 0.5.A different learning rate α σ = 0.05 was chosen for the update of σ (see Eq. ( 6)).Each episode consisted of T = 1250 iterations ( 25s) with one roll-out per episode.A total of B = 5000, 7000, and 35000 batches were conducted for the different heights h.We compare the performance for different values of the IRF factor γ ∈ {0%, 0.05%, 1%, 5%, 25%} and performed 30 experiments for each setting.Figure 7 displays the results.As for the cart-pole experiment, the plots for the PI and entropy in Fig. 7 report a clear picture of an exploration phase (high value) followed by an exploitation phase (lower value).
To compare the results, we set two threshold values at R ex = 5 and R ex = 20 which refer to a 5m and 20m distance between the hexapod and the walls of the arena.The first threshold reflects a successful learning of the task, because it means that hexapod reliably escapes the arena.The second threshold represents the case when in addition also a high locomotion speed is achieved after a successful escape.For the simplicity of argumentation, we compare two cases, i.e. γ = 0% and γ = 1%.If there is no wall (h = 0m) the system with IRF γ = 1% requires only half the amount of batches compared to no IRF (250 batches vs. 500 batches, see Figs. 7A and 7C).For the arena with a medium height (h = 0.1m), the learning success speed ratio increases to approximately three (350 batches vs. 1100 batches, see Figs. 7E and 7F).The results are decisive for the arena with high walls (h = 0.2m), as the system with IRF requires about 1000 batches on average compared to the 5000 batches on average that a required by the systems without IRF (see Figs. 7I and 7K).
This leads to the conclusion that both, PI and entropy, are beneficial if the short-term learning success is of the primary interest.However, the asymptotic learning success of those hexapods with IRF is either equal or lower compared to those without an IRF in all experiments.This is valid for the one-step PI and for the entropy.Thus, both are necessarily beneficial if the long-term, asymptotic learning performance in an episodic policy gradient setting is important.

Discussion
This paper discussed the one-step PI [Bialek et al., 2001] as an information-driven intrinsic reward in the context of an episodic policy gradient method.The reward is considered to be intrinsic, because it is task-independent and it relies only on the information of the sensors of an agent, which, by definition, represent the agent's intrinsic view on the world.We chose the maximisation of the one-step PI as an IRF, because it has proved to encourage behaviours which show properties of morphological computation without the need to model the morphology [Zahedi et al., 2010].
The IRF was linearly combined with a task-dependent ERF in an episodic RL setting.Specifically, PGPE [Sehnke et al., 2010] was chosen as RL method, because it allows to learn arbitrary policy parametrisations.Within this set-up, three different types of experiments were performed.The following paragraph will summarise the results before they are discussed.
The first experiment was the learning of the cart-pole swing-up task.Four controllers were evaluated of which three were less successful and one showed good results.The ERF was designed to be difficult to maximise without the IRF, and the task contradicted the maximisation of the entropy and PI.The best controller did not show a significant improvement of the learning performance with respect to its asymptotic behaviour.An improvement could only be observed during the first learning steps.Moreover, the choice of the linear combination factor γ is critical.For all controllers a minor and not significant improvement is observable.In case of the entropy maximisation, any factor γ > 0% showed an improvement in learning speed and learning performance.
A locomotion behaviour was learned for a hexapod in the second experiment.The entire set-up used well-known components for the environment, modular controller, ERF, and morphology so that the task was solved without IRF in only a few learning steps.No effect of the PI and entropy was observed.
The third experiment combined the previous two and extended them by a non-trivial environment.A hexapod For each value of γ the mean and standard deviation of 30 experiments are displayed.In all cases a speed-up in learning is achieved with IRF, however, the asymptotic performance is worse.had to escape from a trap and was only rewarded outside of it.The results showed no significant difference between the PI and the entropy as IRFs.The learning speed was significantly improved by both IRFs with increasing difficulty of the task.The asymptotic performance was either equal or worse when an IRF was introduced.
The hexapod locomotion experiment teaches us, that the information-theoretic reward functions (PI and entropy) has no effect in well-defined experimental set-ups.
The cart-pole and the hexapod self-rescue experiments teach us that the maximal values of the IRF should be around one percent of the maximal ERF value to improve the learning speed and learning performance in the shortterm.The asymptotic behaviour is either not or negatively effected by the one-step PI.The cart-pole experiment indicates that maximising the entropy is superior to maximising the PI, whereas the hexapod self-rescue does not show such a clear picture.The success of the entropy in both experiments is explained by the ERFs.Due to their nature, random changes in the policy parameters are unlikely to result in changes in the ERF during the first batches.Hence, maximising the entropy results in an exploration until the ERF is triggered.
The PI, defined as the entropy over the sensor states subtracted by the conditional entropy of consecutive sensor states does not result in superior results for the cart-pole compared to just using the entropy for the following reason.In this set-up, the morphology and environment are very simple and deterministic, and therefore, do not produce any noise or other uncertainties in the sensor data stream.The uncertainty about the next possible angular position of the pole is small, if the current pole position is known.In other words, the cart-pole system is regular by definition and no further regularities can be found by maximising the PI.We speculate that the conditional entropy, which cannot be reduced by the learning in this setting, dampens the exploration effect of the entropy term in the PI maximisation.For the hexapod rescue experiment, the situation is different.There is an uncertainty about the next sensor state, given the current sensor state which result from the morphology and the construction of the arena.The PI maximisation is able to find regularities which can then be exploited to maximise the ERF in the RL setting.
The results contradict our intuition, as the one-step predictive information has shown good results when applied as an information-driven self-organisation principle in the context of embodied artificial intelligence [Zahedi et al., 2010, Martius et al., 2013].The intuitively plausible next step was to guide the information-driven self-organization towards solving a goal by combining it with an external reward signal in an reinforcement learning context.The approach evaluated in this paper was to linearly combine the PI with and external reward signal in an episodic policy gradient learning.If anything, then the PI showed positive short-term results, if the world was considerably probabilistic and if the external reward was sparse.Compared to no intrinsic reward the PI showed negative results for its asymptotic behaviour.The performance of the PI was either equal or worse compared to the entropy in all cases.This leads to the conclusion that research in the context of information-driven intrinsic rewards and reinforcement learning should be carried out in other directions, which are briefly described in the final paragraph.
We have used a constant combination factor γ for all experiments presented in this work.It is known from general learning theory that a decaying learning rate is required for the convergence of a system.We chose not to use a decaying learning factor, because this means that the internal drive is slowly dampened until its effect is neglectable (at least in a technical application).This would contradict the idea of motivation-driven and open-ended learning of embodied agents.However, the results of our present paper reveal a disadvantage of this approach in the asymptotic limit, and therefore, suggest, contrary to our original thoughts, to pursue a strategy with a decaying combination factor.The second possible modification of this approach is to exchange the linear combination of the internal and external reward by a non-linear function, of which multiplicative and exponential functions are two examples.Third, using a gradient of the PI instead of a random exploration in the context of RL is a promising approach that is currently investigated.In this approach, we will use a gradient on an estimate of the PI and not the error of a predictor as in e.g.[Schmidhuber, 1991].Fourth, we will continue to evaluate other information-theoretic measures in the context of task-dependent learning with the support of information-driven intrinsic motivation.In addition to using correlation measures, such as the mutual information, we believe that causal measures in the sensorimotor loop [Ay and Zahedi, 2013], such as the measure considered in [Zahedi and Ay, 2013], are good candidates for future research in this field.

Figure 1 :
Figure 1: The sensorimotor loop.Left: schematic diagram of a cognitive system with its interaction with the world.Right: Corresponding causal graph.

Figure 2 :
Figure 2: Controller architectures for the cart-pole swing-up task.The input neurons are bare buffer neurons whereas the hidden and output neurons have tanh transfer-function.(A) from [Pasemann et al., 1999]; (B) with 4 hidden neurons and fully connected; (C,D) recurrent variations without and with lateral connections

Figure 3 :
Figure 3: Results for cart-pole experiments.Each row shows the results for one controller architecture, see Fig. 2. The corresponding connection matrix is provided in the first column (gray: connection, black: no connection).For simplicity only the row for the second controller is discussed in detail.(A,B) ERF and IRF for PI maximisationsmall values of γ > 0 are advantageous.(C,D) ERF and IRF for entropy maximisation -all values of γ > 0 have positive effect.

Figure 4 :
Figure 4: Hexapod for locomotion task and controller set-up.(A) Hexapod robot with marked actuated joints and sensors; (B) leg module of controller; (C) entire controller; and (D) schematic pairings for PI and entropy calculation.

Figure 5 :
Figure 5: Results for hexapod locomotion task.ERF and IRF with PI maximisation (A,B) and entropy maximisation (C,D).No significant effect is observed.

Figure 6 :
Figure 6: Hexapod robot for self-rescue and the experimental setup.(A) The robot has 6 legs where the hind legs are 10% larger than the other legs.Each leg has two active DoF in the hip joint and one passive DoF in both the knee and the ankle joint equipped with a spring.Additionally the whiskers have each two spring-joints.(B) The robot starts in the centre of the trap with a certain barrier height and has to escape from it.The reward is the distance from the outside of the trap or zero otherwise.

Figure 7 :
Figure 7: Performance in the self-rescue task depending on the internal reward type and factor γ. Plotted are the and the IRF in case of PI (A,B,E,F,I,J) and entropy (C,D,G,H,K,L) over the number of batches for different values of γ and barrier heights h: (A-D) no barrier (h = 0), (E-H) low barrier (h = 0.1) and (I-L) high barrier (h = 0.2).For each value of γ the mean and standard deviation of 30 experiments are displayed.In all cases a speed-up in learning is achieved with IRF, however, the asymptotic performance is worse.