Neural model for learning-to-learn of novel task sets in the motor domain

During development, infants learn to differentiate their motor behaviors relative to various contexts by exploring and identifying the correct structures of causes and effects that they can perform; these structures of actions are called task sets or internal models. The ability to detect the structure of new actions, to learn them and to select on the fly the proper one given the current task set is one great leap in infants cognition. This behavior is an important component of the child's ability of learning-to-learn, a mechanism akin to the one of intrinsic motivation that is argued to drive cognitive development. Accordingly, we propose to model a dual system based on (1) the learning of new task sets and on (2) their evaluation relative to their uncertainty and prediction error. The architecture is designed as a two-level-based neural system for context-dependent behavior (the first system) and task exploration and exploitation (the second system). In our model, the task sets are learned separately by reinforcement learning in the first network after their evaluation and selection in the second one. We perform two different experimental setups to show the sensorimotor mapping and switching between tasks, a first one in a neural simulation for modeling cognitive tasks and a second one with an arm-robot for motor task learning and switching. We show that the interplay of several intrinsic mechanisms drive the rapid formation of the neural populations with respect to novel task sets.


INTRODUCTION
The design of a multi-tasks robot that can cope with novelty and evolve in an open-ended manner is still an open challenge for robotics. It is however an important goal (1) for conceiving personal assistive robots that are adaptive (e.g., to infants, the elderly and to the handicapped people) and (2) for studying from an inter-disciplinary viewpoint the intrinsic mechanisms underlying decision making, goal-setting and the ability to respond on the fly and adaptively to novel problems.
For instance, robots cannot yet reach the level of infants for exploring alternative ways to surmount an obstacle, searching for a hidden toy in a new environment, finding themselves the proper way to use a tool, or solving a jigsaw puzzle. All these tasks require to be solved within boundaries of their given problem space, without exploring it entirely. Thus, robots lack this ability to detect and explore new behaviors and action sequences oriented toward a goal; i.e., what is called a task set (Harlow, 1949;Collins and Koechlin, 2012).
The ability to manipulate dynamically task sets is however a fundamental aspect of cognitive development (Johnson, 2012). Early in infancy, infants are capable to perform flexible decisionmaking and dynamic executive control even at a simple level in order to deal with the unexpected (Tenenbaum et al., 2011). Later on, when they are more mature, they learn to explore the tasks space, to select goals and to focus progressively on tasks of increasing complexity. One example in motor development is the learning of different postural configurations. Karen Adolph explains for instance how infants progressively differentiate their motor behaviors into task sets (i.e., the motor repertoire) and explore thoroughly the boundaries of each postural behavior till becoming expert on what they discover Joh, 2005, 2009). Adolph further argues that the building of a motor repertoire is not preprogrammed with a specific developmental timeline but that each postural behavior can be learned independently as separated tasks without pre-ordered dependencies to the other ones (crawling, sitting, or standing).
This viewpoint is also shared by neurobiologists who conceive the motor system to structure the actions repertoire into "internal models" for each goal to achieve (Wolpert and Flanagan, 2010;Wolpert et al., 2011). Each novel contextual cue (e.g., handling a novel object) promotes the acquisition and the use of a distinct internal model that does not modify the existing neural representations used to control the limb on its own (White and Diedrichsen, 2013). Moreover, each task set is evaluated depending on the current dynamics and on the current goal we want to perform (Orban and Wolpert, 2011). For instance, we switch dynamically from different motor strategies to the most appropriate one depending on the context; e.g., tilting the racket to the correct angle in order to give the desired effect on the ball, or for executing the proper handling of objets with respect to their estimated masses (Cothros et al., 2006).
From a developmental viewpoint, the capability for flexible decision-making gradually improves in 18 months-old infants (Tenenbaum et al., 2011). Decision-making endows infants to evaluate the different alternatives they have for achieving one goal with respect to the ongoing sequence and to select the correct one(s) among different alternatives. It owes them also the possibility to inhibit some previously learned strategies in order to explore new ones never seen before (Yokoyama et al., 2005).
IN AI, this craving to explore, to test and to embed new behaviors is known as intrinsic motivation (Kaplan and Oudeyer, 2007). In Kaplan and Oudeyer's words: "The idea is that a robot (...) would be able to autonomously explore its environment not to fulfill predefined tasks but driven by some form of intrinsic motivation that pushes it to search for situations where learning happens efficiently". In this paper, we focus more on the idea that the rewards are self-generated by the machine itself (Singh et al., 2010) and that the function of intrinsic motivation is mainly to regulate the exploration/exploitation problem, driving exploratory behavior and looking for different successful behaviors in pursuing a goal. In that context, we propose that the ability to choose whether or not to follow the same plan or to create a novel one out of nothing-in regard to the current situation-is an intrinsic motivation. We studied for instance the role of the neuromodulator acetylcholine in the hippocampus for novelty detection and memory formation (Pitti and Kuniyoshi, 2011).
Meanwhile, the capability to make decision and to select between many options is one important aspect of intrinsic motivation because otherwise the system would be only passive and would not be able to select or encourage one particular behavior. Taking decisions in deadlock situations requires therefore some problem-solving capabilities like means-end reasoning (Koechlin et al., 2003) and error-based learning capabilities (Adolph and Joh, 2009). For instance, means-end reasoning and error-based learning are involved in some major psychological tests such as the Piagetian "A-not-B error test" (Diamond, 1985;Smith et al., 1999;Schöner and Dineva, 2007), Harlow's learning set test (Harlow, 1949) and tool-use (Lockman, 2000;Fagard et al., 2012;Vaesen, 2012;Guerin et al., 2013). The A-not-B error test describes a decision-making problem where a 9-month old infant still pertains to select an automatic wrong response (e.g., the location A) and cannot switch dynamically from this erronous situation to the correct one (e.g., the location B). Above this age, however, infants do not make the error and switch rapidly to the right location. A similar observation is found in Harlow's experiments on higher learning (Harlow, 1949) where Rhesus monkeys and humans have to catch the pattern of the experiment in a series of learning experiences. Persons and monkeys demonstrate that they learn to respond faster when facing a novel and similar situation by switching to the correct strategy, by catching the pattern to stop making the error: they show therefore that they do not master isolated tasks but, instead, they grasp the relation between the events. In one situation, if the animal guessed wrong on the first trial, then it should switch directly to the other solution. In another situation, if it guessed right on the first trial, then it should continue. This performance seems to require that the monkey, the baby or the person use an abstract rule and solve the problem with an apparent inductive reasoning (Tenenbaum et al., 2011). In line with these observations on the development of flexible behaviors, researchers focused on tool-use: when infants start to use an object as a means to an end, they serialize their actions toward a specific goal, as for example reaching a toy with a stick Rat-Fischer et al., 2012;Guerin et al., 2013). Tool-use requires also finding patterns like the shape of grasping, order and sequentiality of patterns (Cothros et al., 2006).
Considering the mechanisms it may involve, Karen Adolph emphasizes the ability of learning-to-learn (Adolph and Joh, 2005), a process akin to Harlow (1949). Harlow coined the expression to distinguish the means for finding solutions to novel problems from simple associative learning and stimulus generalization (Adolph, 2008). Adolph reinterprets this proposal and suggests that two different kinds of thinking and learning are at work in the infant brain, governing the aspects of exploration and of generalization (Adolph and Joh, 2009). On the one hand, one learning system is devoted to the learning of task sets from simple stimulus-response associations. For instance, when an infant recognizes the context, he selects his most familiar strategy and reinforces it within his delimiting parameter ranges. On the other hand, a second learning is devoted to detect a new situation as is and to find a solution dynamically in a series of steps. Here, the acceptance of uncertainty gradually leads for making choices and decisions in situation never seen before. However, which brain regions and which neural mechanisms this framework underlies?
Among the different brain regions, we emphasize that the postparietal cortex (PPC) and the pre-frontal cortex (PFC) are found important (1) for learning context-dependent behavior and (2) for evaluating and selecting these behaviors relative to their uncertainty and error prediction. Regarding the PPC, different sensorimotor maps co-exist to represent structured information like spatial information or the reaching of a target, built on coordinate transform mechanisms (Stricanne et al., 1996;Andersen, 1997;Pouget and Snyder, 2000). Furthermore, recent studies acknowledge the existence of context-specific neurons in the parietomotor system for different grasp movements (Brozovic et al., 2007;Andersen and Cui, 2009;Baumann et al., 2009;Fluet et al., 2010). Regarding the PFC, Johnson identifies the early development of the pre-frontal cortex as an important component for enabling executive functions (Johnson, 2012) while other studies have demonstrated difficulty in learning set formation following extensive damage of the prefrontal cortex (Warren and Harlow, 1952;Yokoyama et al., 2005). The PFC manipulates information on the basis of the current plan (Fuster, 2001), and it is active when new rules need to be learned and other ones rejected. Besides, its behavior is strongly modulated by the anterior cingulate cortex (ACC) which plays an active role for evaluating task sets and for detecting errors during the current episode (Botvinick et al., 2001;Holroyd and Coles, 2002;Khamassi et al., 2011). If we look now at the functional organization of these brain structures, many authors emphasize the interplay between an associative memory of action selection in the temporal and parietal cortices (i.e., an integrative model) and a working memory for actions prediction and decision making in the frontal area (i.e., a serial model) (Fuster, 2001;Andersen and Cui, 2009;Holtmaat and Svoboda, 2009). All-in-all, these considerations permit us to draw a scenario based on a two complementary learning systems.
More precisely, we propose to model a dual system based on (1) the learning of task sets and on (2) the evaluation of these task sets relative to their uncertainty, and error prediction. Accordingly, we design a two-level based neural system for context-dependent behavior (PPC) and task exploration and prediction (ACC and PFC); see Figure 1. In our model, the task sets are learned separately by reinforcement learning in the post parietal cortex after their evaluation and selection in the prefrontal cortex and anterior cyngulate cortex. On the one hand, the learner or agent stores and exploits its familiar knowledge through a reinforcement learning algorithm into contextual patterns called and collected from all its different modalities. On the other hand, the learner evaluates and compares the way it learns, and selects the useful strategies while it discards others or tests new ones on the fly if no relevant strategy is found. We perform two different experimental setups to show the sensorimotor mapping and switching between tasks, one in a neural simulation for modeling cognitive tasks and another with an arm-robot for motor task learning and switching. We use neural networks to learn simple sensorimotor mapping for different tasks and compute their variance and error for estimating the sensorimotor prediction. Above a certain threshold, the error signal is used to select and to valuate the current strategy. If no strategy is found pertinent for the current situation, this corresponds to a novel motor schema that is learned independently by a different map. In a cognitive experiment similar to Harlow (1949) and Diamond (1990), we employ this neural structure to learn multiple spatio-temporal sequences and switch between different strategies if an error has occurred or if a reward has been received (error-learning). In a psycho-physic experiment similar to Wolpert and Flanagan (2010), we show how a robotic arm learns the visuomotor strategies for stabilizing the end-point of its own arm when it moves it alone and when it is holding a long stick. Here, the uncertainty on the spatial location of the end-point triggers the decision-making from the two strategies by selecting the best one given the proprioceptive and visual feedback and the error signal delivered.

MATERIALS AND METHODS
In this section, we present the neural architecture and the mechanisms that govern the dynamics of the neurons, of reinforcement learning and of decision-making. We describe first the bioinspired mechanism of rank-order coding from which we derive the activity of the parietal and of the pre-frontal neurons. In second, we describe the reinforcement learning algorithm, the error prediction reward and the decision-making rules.

PPC-GAIN-FIELD MODULATION AND SENSORIMOTOR MAPPING
We employ the rank-order coding neurons to model the sensorimotor mapping between input and output signals with an architecture that we have used in a previous research (Pitti et al., 2012). This architecture implements multiplicative neurons, called gain-field neurons, that multiply unit by unit the value of two or more incoming neural populations, see Figure 2. Its organization is interesting because it transforms the incoming signals into a basis functions' representation that could be used to simultaneously represent stimuli in various reference frames (Salinas and Thier, 2000). The multiplication between afferent sensory signals in this case from two population codes, X m 1 and X m 2 , {m 1 , m 2 ∈ M 1 , M 2 }, produces the signal activity X n to the n gain-field neurons, n ∈ N: The key idea here is that the gain-field neurons encode two information at once and that the amplitude of the gain-field neurons relates the values of one modality conditionally to the other; see Figure 2A. The task is therefore encoded into a space of lower dimension (Braun et al., 2009(Braun et al., , 2010. We exploit this feature to model the parietal circuits for different contextual cues and internal models, which means that, after the encoding, the output layers learn the receptive fields of the gain-field map and translates this information into various gain levels. In Figure 2B, we give a concrete example of one implementation, here delineated to two modalities, with N gain-fields projecting to three different tasks set of different size. We explain thereinafter (1) how the gain fields neurons learn the associations between various modalities and (2) how the neurons of the output map learn from the gain fields neurons for each desired task.

RANK-ORDER CODING ALGORITHM
We implement a hebbian-like learning algorithm proposed by Van Rullen et al. (1998) called the Rank-Order Coding (ROC) algorithm. The ROC algorithm has been proposed as a discrete and faster model of the derivative integrate-and-fire neuron (Van Rullen and Thorpe, 2002). ROC neurons are sensitive FIGURE 1 | Framework for task set selection. The whole system is composed of three distinct neural networks, inspired from Khamassi et al. (2011). The PPC network conforms to an associative network. It binds the afferent sensory inputs from each other and map them to different motor outputs with respect to a task set. The ACC system is a error-based working memory that processes the incoming PPC signals and feeds back an error to them with respect to current task. This modulated signal is used to tune the population of neurons in PPC by reinforcement learning, it is also conveyed to the PFC map, which is a recurrent network that learns dynamically the spatio-temporal patterns of the ongoing episodes with respect to the task.

www.frontiersin.org
October 2013 | Volume 4 | Article 771 | 3 to the sequential order of the incoming signals; that is, its rank code, see Figure 3A. The distance similarity to this code is transformed into an amplitude value. A scalar product between the input's rank code with the synaptic weights furnishes then a distance measure and the activity level of the neuron. More precisely, the ordinal rank code can be obtained by sorting the signals' vector relative to their amplitude levels or to their temporal order in a sequence. We use this property respectively for modeling the signal's amplitude for the parietal neurons and the spatio-temporal patterns for the prefrontal neurons. If the rank code of the input signal matches perfectly the one of the synaptic weights, then the neuron fully integrates this activity over time and fires, see Figure 3A. At contrary, if the rank order of the signal vector does not match properly the ordinal sequence of the synaptic weights, then integration is weak and the neuron discharges proportionally to it, see Figure 3B. The neurons' output X is computed by multiplying the rank order of the sensory signal vector I, rank(I), by the synaptic weights w; w ∈ [0, 1]. For a vector signal of dimension M and for a population of N neurons (M afferent synapses), we have for the GF neurons and for the output PPC neurons: The updating rule of the neurons' weights is similar to the winner-takes-all learning algorithm of Kohonen's self-organizing maps (Kohonen, 1982). For the best neuron s ∈ N and for all afferent signals m ∈ M, we have for the neurons of the output layer: the equations are the same for GF neurons (not reproduced here).
We make the note that the synaptic weights follow a power-scale density distribution that makes the rank-order coding neurons similar to basis functions. This attribute permits to use them as receptive fields so that the more distant the input signal is to the receptive field, the lower is its activity level; e.g., Figure 3B.

FIGURE 2 | Task sets mapping, the mechanism of gain-fields. (A)
Gain-fields neurons are units used for sensorimotor transformation. They transform the input activity into another base, which is then fed forward to various outputs with respect to their task. Gain-fields can be seen as meta-parameters that decrease the complexity of the sensory-motor problem into a linear one. (B) example of GF neurons sensorimotor transformation for two modalities projecting to three different task sets; each GF neuron contributes to one particular feature of the tasks (Pouget and Snyder, 2000;Orban and Wolpert, 2011).

REINFORCEMENT LEARNING AND ERROR REWARD PROCESSING
The use of the rank-order coding algorithm provides an easy framework for reinforcement learning and error-based learning (Barto, 1995). For instance, the adaptation of the weights in Equation 3 can be modified simply with a variable α ∈ [0, 1] that can ponder w; see Equation 4. If α = 0, then the weights are not reinforced: W t + 1 = W t . If α = 1, then the weights are reinforced in the direction of W: W t + 1 = W t + α W. In addition, conditional learning can be made simply by summing an external bias β to the neurons output X. By changing the amplitude of the neurons, we change also the rank-order to be learned and influence therefore the long-term the overall organization of the network; see Equation 5.

Cortical plasticity in PPC
For modeling the cortical plasticity in the PPC output maps, we implement an experience-driven plasticity mechanism.
Observations done in rats show that during the learning of novel motor skills the synapses rapidly spread in the neocortex immediately as the animal learns a new task (Xu et al., 2009;Ziv and Ahissar, 2009). Rougier and Boniface proposed a dynamic learning rule in self-organizing maps to combine both the stability of the synapses' population to familiar inputs and the plasticity of the synapses' population to novel patterns (Rougier and Boniface, 2011). In order to model this feature in our PPC map, we redefine the coefficient α in Equation 5 and we rearrange the formula proposed by Rougier and Boniface: where η is the elasticity or plasticity parameter that we set to 1 and max(X PPC ) is the upper bound of the neural activity, its maximal value, whereas max(X PPC ) is the current maximum value within the neural population, with α = 0 when X PPC s = max(X PPC ). In this equation, the winner neuron learns the data according to its own distance to the data. If the winner neuron is close enough to it, it converges slowly to represent the data. At contrary, if the winner neuron is far from the data, it converges rapidly to it.

Error-reward function in ACC
For modeling ACC, we implement an error-reward function similar to Khamassi et al. (2011) and to Q-learning based algorithms. The neurons' value is updated afterwards only when an error occurs, then a ihnibitory feedback error signal is sent to the winning neuron to diminish its activity X win : ACC(X win ) = −1; the neurons equation X is updated as follows: The neurons activity in ACC is cleared everytime the system responds correctly or provides a good answer. ACC can be seen then as a contextual working memory, a saliency buffer extracted from the current context when errors occur inhibiting the wrong actions performed. Its activity may permit to establish an exploration-based type of learning by trial and errors and an attentional switch signal from automatic responses, in order to deal with the unexpected when a novel situation occurs.

PFC-SPATIO-TEMPORAL LEARNING IN A RECURRENT NETWORK
We can employ the rank-order coding for modeling spike-based recurrent neural network in which the amplitude values of the incoming input signals are replaced by its past spatio-temporal activity pattern. Although the rank-order coding algorithm has been used at first to model the fast processing of the feed-forward neurons in V1, its action has been demonstrated to replicate also the hebbian learning mechanism of Spike Timing-Dependent Plasticity (STDP) in cortical neurons (Bi and Poo, 1998;Abbott and Nelson, 2000;Izhikevich et al., 2004). For a population of N neurons, we arbitrarily choose to connect each neuron to a buffer of size 20 × N so that they encode the rank code of the neurons amplitude value over the past 20 iterations. At each iteration, this buffer is shifted to accept the new values of the neurons.
Recurrent networks can generate novel patterns on the fly based on their previous activity pattern while, at each iteration, a winning neuron gets its links reinforced. Over time, the system regulates its own activity whereas coordinated dynamics can be observed. These behaviors can be used for anticipation and predictive control.

RESULTS
We propose to study the overall behavior of each neural system during the learning of task sets and the dynamics of the ensemble working together. The first three experiments are performed in a computer simulation only. They describe the behavior of the PPC maps working solely, working along the ACC system and working along the ACC and PFC systems for learning and selecting context-dependent task sets. Experiment 4 is performed on a robot arm. This experiment describes the acquisition and the learning of two different task set during the manipulation or not of a tool.

EXPERIMENT 1-PLASTICITY vs STABILITY IN LEARNING TASK SETS
In this first experiment, we test the capabilities of our network to learn incrementally novel contexts without forgetting the older ones, which corresponds to the so-called plasticity/stability dilemma of a memory system to retain the familiar inputs as well as to incorporate flexibly the novel ones. Our protocol follows the diagram in Figure 4 in which we show gradually four different contexts for two input modalities with vectors of ten indices. The input patterns are randomly selected from an area in the current context chosen randomly and for a period of time also variable. In this experiment, the PPC output map has 50 neurons that receive the activity of twenty gain-fields neurons, see Figure 2B. We display in Figure 5A the raster plot of the PPC neurons' dynamics with distinct colors with respect to the context. Contexts are given gradually, one at a time, so that some neurons have to unlearn their previous cluster first in order to fit the new context. It is important to note that categorization is unsupervised and decided due to the experience-driven plasticity rule in Equation 6. In order to demonstrate the plasticity of the PPC network during the presentation of a new context, we present the context number four, plotted in magenta and never seen before, at t = 11500. Here, the new cluster is rapidly formed FIGURE 4 | Protocol setup in task sets learning. This simple protocol explains how the experimental setup is done for acquiring different contexts incrementally and for selecting them. and stable over time due again to the cortical plasticity mechanism from Equation 6. The graph displays therefore not only the plasticity of the clusters in the PPC network but also their robustness.
This property is also shown in Figure 5B where the convergence rates of the PPC weights vary differently for each task. This result explains how the PPC self-organizes itself into different clusters that specialize flexibly with respect to the task. The ratio between stability and plasticity in shown in Figure 5C within the network with the histogram of the neuron's membership over a certain time interval. The stability of one neuron is computed as its probability distribution relative to each context. The higher values correspond to very stable neurons, which are set to one context only and do not deviate from it, whereas the lower values correspond to very flexible neurons that change frequently context from one to another.
The histogram shows two probability distributions within the system and therefore two behaviors. For the neurons corresponding to values near the strong peak at 1.0, their activity is very stable and strongly identified to one context. This shows that for one third of the neurons, the behavior of the neural population is very stable. At reverse, the power law curve centered on 0.0 shows the high variability of certain neurons, which are very dynamic for one third of the neural population.
We study now the neurons' activity during a task switch in Figure 6. In graph (A), the blue lines correspond to the neurons' dynamic belonging to the context before the switch and the red lines correspond to the neurons' dynamic belonging to the context after the switch. The activity level in each cluster is very salient for each context. The probability distribution of the neurons' dynamic, with respect to each context is plotted in Figure 6B. It shows a small overlap between the contexts before and after the switch.

EXPERIMENT 2-LEARNING TASK SETS WITH A REINFORCEMENT SIGNAL
In this second experiment, we reproduce a decision-making problem similar to those done in monkeys and humans with multiple choices and rewards (Churchland and Ditterich, 2012). The rules are not given in advance and the tasks switch randomly after a certain period of time with no regular pattern. The goal of the experiment is to catch the input-output correspondence pattern to stop making the error. The patterns are learned dynamically by reinforcement learning within each map and should ideally be done without interference from each other. The error signal indicates when an input-output association is erronous with respect to a hidden policy, however, we make the note that it does not provide any hint about how to minimize the error. To understand how the whole system works, we focus our experiment on the PPC network with the ACC error processing system first, then with the PFC network. We choose to perform a two-choices experiment, with two output PPC maps initialized with random connections from the PPC map. The PPC network consists therefore of the gain-field architecture with the two output maps for modeling the two contexts. The two maps are then bidirectionally linked to the ACC system; the input signals for modality 1 and 2 are projected to the PPC input vectors of twenty units each; map1 has twelve output units and map2 has thirteen output units and project to ACC of dimension twenty-five units. The hidden context we want the PPC maps to learn is to have output signals activated for specific interval range of the inputs signals, namely, the first output map has to be activated when input neurons of indices below ten are activated, and reciprocally, the second output map has to be activated when input neurons of indices above ten are activated-this corresponds to the two first contexts in Figure 4. The error prediction signal is updated anytime a mistake has been done on the interval range to learn. As expressed in the previous section, the ACC error signal resets always its activity when the PPC maps start to behave correctly.
We analyze the performance of the PPC-ACC system in the following. We display in Figures 7A,B the raster plots of the PPC and ACC dynamics with respect to the context changes for different periods of time. The chart on the top displays the timing for context switch, the chart on the middle plots the ACC system working memory and the chart below plots the output of the PPC units. The Figure 7A is focusing on the beginning of the learning phase and the Figure 7B when the system has converged. We observe from these graphs that the units of the output maps self-organize very rapidly to avoid the error. ACC modulates negatively the PPC signals. We make the note that the error signal does not explicitly inhibit one map or the other but only the wrongly actived neuron of the map. As it can be observed, over time, each map specializes to its task. As a result, learning is not homogenous and depends also to the dimension of the context; that is, each map learns with a different convergence rate. ACC error rapidly reduces its overall activity for the learning of task1 with respect to map1, although the error persists for the learning of task2 with respect to map2 where some neurons still fires wrongly.
We propose to study the convergence of the two maps and the confidence level of the overall system for the two tasks. We define a confidence level index as the difference of amplitude between the most active neurons in map1 and map2. We plot its graph in Figure 8 where the blue color corresponds to the confidence level for task1 with v s_map1 − v s_map2 and the color red corresponds to the confidence level for task2 with v s_map2 − v s_map1 during the learning phase. The dynamics reproduce similar trends from Figure 7 where the confidence level constantly progresses till convergence to a stable performance rate, with a FIGURE 6 | Cluster dynamics at the time to switch. (A) Neural dynamics of the active clusters before and after the switch; resp. in blue and in red. (B) Histogram of the neural population at the time to switch with respect to the active clusters before and after the switch.

www.frontiersin.org
October 2013 | Volume 4 | Article 771 | 7 FIGURE 7 | Experiment on two-choices decision making and task switching. (A) Neural dynamics of PPC neurons and ACC error system during task switch. We plot in the chart in the top the temporal interval for each task. Below the, neural dynamics of the PPC maps and in the middle, its erronous activity retranscribed in the ACC system. ACC works as a working memory that keep tracks of the erronous outputs, which is used during the learning stage. ACC is reset each time the PPC system gives a correct answer. Through reinforcement learning, the PPC maps converge gradually to the correct probability distribution. (B) Snapshot of the PPC maps in blue modulated negatively by ACC in red.

FIGURE 8 | Confidence Level of PPC maps during task switch, dynamics and histogram. (A)
The confidence level is the difference between the amplitude of most activated neuron and the second one within each map. After one thousand iterations, the two maps rapidly specialize their dynamics to its associated task. This behavior is due to the ACC error-based learning.
(B) histogram of the probability distribution of the confidence level with and without ACC. With ACC, we observe a clear separation in two distributions, which correspond to a decrease of uncertainty with respect to the task. In comparison, the confidence level in an associative network without an error feedback gives a uniform distribution.
threshold around 0.4 above which a contextual state is recognized or not. Before 1000 iterations, the maps are very plastic so the confidence level fluctuates rapidly and continuously between different values but at the end of the learning phase, the maps are more static so the confidence level appears more discrete. This state is clearly observable from the histogram of the confidence level plotted on the right in Figure 8B for the case where the ACC error signal is injected to the associative network. The graph presents a probability distribution with two bell-shaped centred on 0.1 and 0.7, which corresponds to the cases of recognition or not of the task space. In comparison, the probability distribution for the associative learning without error-feedback is uniform, irrespective to the task; see Figure 8B in blue.

EXPERIMENT 3-ADAPTIVE LEARNING ON A TEMPORAL SEQUENCE BASED ON ERROR PREDICTION REWARD
We attempt to replicate now Harlow's experiments on adaptive learning, but, in comparison to the previous experiments, it is the temporal sequence of task sets that is taken into account for the reward. We employ our neural system in a cognitive experiment first to learn multiple spatio-temporal sequences and then to predict when a change of strategy has occurred based on the error or on the reward received. With respect to the previous section also, we add the PFC-like recurrent neural network to learn the temporal sequence from the PPC and ACC signals, see Figure 1.
The experiment is similar to the previous two-choices decision-making task, expect that the inputs follow now a temporal sequence within each map. When the inputs reach a particular

Frontiers in Psychology | Cognitive Science
October 2013 | Volume 4 | Article 771 | 8 point in the sequence-, a point to switch,-we proceed to a random choice between one or the two trajectories. As in the previous section, the learning phase for the PPC rapidly converges to the specialization of the two maps thanks to the ACC error-learning processing. Meanwhile, the PFC learns the temporal organization of the PPC outputs based on their sequential order, Figure 9A. We do not give to the PFC any information about length, the number of patterns or the order of the sequence. Besides, each firing neuron reinforces its links with the current pre-synaptic neurons; see the raster plot in Figure 9B. After the learning phase, each PFC neuron has learned to predict some portion of the sequence based on the past and current PFC activity.
Their saliency to the current sequence is retranscribed in their amplitude level. We plot the activity level of the neurons #10 and #14 respectively in black and red in the second chart. This graph shows that their activity level gradually increases for period intervals of at least ten iterations till their firing. The points to switch are also learned by the network and they are observable when the variance of the neurons' activity level becomes low, which is also seen when the confidence level goes under 0.4; which corresponds to the dashed black line in the first chart. For instance, we plot the dynamics of the PPC neurons and of the PFC neurons during such situation in Figure 10A at time t = 1653. The neural dynamics of each map display different patterns and therefore, different decisions. The PPC activates more the neurons of the first map (the neurons with indices below thirteen in blue) whereas the PFC activates more the neurons of the second map (the neurons with indices above thirteen in dashed red). This shows that the PFC is not a purely passive system driven by the current activity in PPC/ACC. Besides, it learns also to predict the future events based on its past activity. The PFC fuses the two systems in its dynamics, and this is why it generates here a noisy output distribution due to the conflicting signals. We plot in Figure 10B the influence of PPC on the PFC dynamics. In 60% of the cases, the two systems agree to predict the current dynamics. This corresponds to the case of an automatic response when familiar dynamics are predicted. During conflicts, a prediction error is done by one of the two systems and in more cases the PPC dynamics, modulated by ACC, overwrite the values of the PFC units (blue bar). This situation occurs during a task switch for instance. At reverse, when PFC elicites its own values with respect to PPC (red bar), this situation occurs more when there is ambiguous sensory information that can be overpassed. In order to understand better the decision-making process within the PFC map, we display in Figures 11A,B the temporal integration done dynamically at each iteration within the network. Temporal integration means the process of summing the weights in Equation 2 at each iteration with respect to the current order. If the sequence order is well recognized, then the neuron's value goes high very rapidly, otherwise its value remains to a low value. As we explained it in the previous paragraph, each neuron is sensitive to certain patterns in the current sequence based on the synaptic links within the recurrent network. This is translated in the graph by the integration of bigger values. The spatio-temporal sequences they correspond to are darkened proportionally to their activation level. The higher is the activation level integration during the integration period, the faster is the anticipation of the sequence. We present the cases for a unambiguous pattern in Figure 11A and for an ambiguous sequence activity in Figure 11B. The case for a salient sequence recognition in Figure 11A indicates that the current part of the sequence is well estimated by at least one neuron, the winning neuron, which predicts well the sequence over twenty steps in advance, see the chart below. In comparison, the dynamics in Figure 11B show a more uniform probability distribution. This situation arises when a bifurcation point is near in the sequence, it indicates that the system cannot predict correctly the next steps of the sequence.
Considering the decision-making process per se, there is not a strict competition between the neurons, however, each neuron FIGURE 9 | Raster plot for PFC neurons. In (A), the PFC learns the particular temporal sequence from PPC outputs and it is sensitive to the temporal order of each unit in the sequence. In (B) on the top chart, the confidence level on the incoming signals from the PPC is plotted. The chart in the middle displays the neural activity for two neurons from the two distinct clusters. The neuron #10 in black (resp. cluster #1) and the neuron #14 in red (resp. cluster #2). The raster plot of the whole system is plotted in the chart below.

www.frontiersin.org
October 2013 | Volume 4 | Article 771 | 9  promotes one spatio-temporal sequence and one probability distribution. Therefore, we have within the system 25 spatiotemporal trajectories embedded. Based on the current situation, some neurons will detect better one portion of the sequence than others and the probability distribution will be updated in consequence to chain the actions sequentially, whereas other portions will collapse. The decision-making looks therefore similar to a self-organization process. At this point, no inhibitory system has been implemented directly in PFC that would avoid a conflict in the sequence order. Instead, the PFC integrates the PPC signals with the ACC error signals. The temporal sequences done in the PPC to avoid the errors at the next moves are learned little by little by reinforcement in the PFC. These sequences become strategies for error avoidance and explorative search. Over time, they learn the prediction of reward and the prediction of errors (Schultz et al., 1997;Schultz and Dickinson, 2000).
We perform some functional analysis on the PFC network in Figure 12. The connectivity circle in Figure 12A can permit to visualize the functional organization of the network at the neurons' level. We subdivide the PFC network into two submaps corresponding to the task dynamics in blue and red. We draw the strong intra-map connections between the neurons in the same color to their corresponding sub-maps as well as the strong inter-map connections between neurons of each map. Each neuron has a different connectivity in the network and the more it has connection the more it is central in the network. These neurons propagate information within and between the sub-maps, see Figure 12B. In complex systems terms, they are hub-like neurons from which different trajectories can be elicited. In decision-making, they are critical points for changing task. The density probability distribution plotted in Figure 12C shows that the maximum number of connections per neuron with strong synaptic weights reaches the number of four connections.

FIGURE 12 | PFC network analysis. (A)
Connectivity circle for the neurons of the PFC map. In blue are displayed the neurons belonging to cluster 1 and in red are displayed the neurons belonging to cluster 2. The number of links within each cluster (intra-map connectivity) is higher than the number of links between them (inter-map connectivity). Moreover, the number of highly connected neurons is also weak. these charateristic replicate the ones of complex systems and of small-world networks in particular. (B) Task switch is done through these hub-like neurons which can direct the trajectory from one or the other task. (C) The connectivity level per neurons within the network follows a logarithmic curve typical of complex networks, where the mostly connected neurons are also the fewer and the most critical with 4 distant connections. (D) The PFC network contributes to enhance the decision-making process in comparison to the PPC-ACC system due to the learning of the temporal sequence and to its better organization.
Their number drastically diminishes with respect to the number of connections and their trend follows a logarithmic curve. These characteristics correspond the properties of small-world and scale-free networks.
In Figure 12D, we analyze the performance of the overall system when the PFC is added. The decision-making done in the PFC permits to decrease the error by a factor two: ten percents error in comparison to experiment 2. The prediction done in the recurrent map shows that the PFC is well organized to anticipate rewards and also task switch.

EXPERIMENT 4-ROBOTIC EXPERIMENT ON SENSORIMOTOR MAPPING AND ACTION SELECTION
We want to perform now a robotic experiment on action selection and decision making in the motor domain with a robotic arm of 6 degrees of freedom from the company Kinova; see Figure 13.
We inspire ourself on the one hand from Wolpert's experiments on structural learning and representation of uncertainty in motor learning (Wolpert and Flanagan, 2010;Orban and Wolpert, 2011) and on the other hand from Iriki's experiments on the spatial adaptation following active tool-use (Iriki et al., 1996;Maravita and Iriki, 2004). Here, we attempt to learn different relations between states and motor commands when the robot controls its own arm alone and when it handles a tool. The question arises whether the robot will learn the structural affordances of the tool as a distinct representation or, instead, as part of its limb's representation (Cothros et al., 2006;Kluzik et al., 2008). Iriki et al. (1996) reported that bimodal-cell visual receptive fields (vRFs) show spatial adaptation following active tool-use, but not passive holding. The spatial estimation of its own body limits-that is, its body image,-is different depending on the attention to the tool. The goal is therefore to estimate properly the current situation on which the robot is, which means handling a stick or not, actively or passively. In our framework, we expect that the errors of spatial estimation on the end-point can be gradually learned and that sensorimotor mapping will change with respect to the tasks the robot has to perform (Wolpert and Flanagan, 2010;Orban and Wolpert, 2011). Figures 13A,B display the arm robot when it holds a salient toy and when it handles a stick with the toy at its end-point. In this experiment, a fixed camera is mapping the x-y coordinates of the salient points (i.e., the toy) while the robot moves its arm around its elbow; we make the note that we circumscribe the problem to two modalities only in order to control just one articulation with respect to the Y axis in the camera.
In the previous experiments, we did not exploit specifically the properties of the gain-field neurons for mapping sensorimotor FIGURE 13 | Robot arm Kinova for task-set selection. The two task-sets correspond to (A) the situation when it is moving its hand alone with the red target on its hand and (B) the situation when it is moving the stick on its hand with the red target on the tip of the tool.
transformation. Here instead, we use the gain-field mechanism to combine the visuomotor information into the PPC system for the two contexts. With respect to the task, the PPC output maps will learn the specific amplitude of the gain-field neurons corresponding to the specific visuomotor relationships (Holmes et al., 2007).
For instance, we plot in Figures 14A-D the activity level of four different gain-field neurons relative to the motor angle θ 0 of the robot arm. The blue dots represent the situation when it weaves the hand in front of the camera and the red dots represent the situation when it is handling the tool. As the gain-field neurons learn the specific relationship between certain values of the XY coordinates of the end-point effector and the motor angle θ 0 , this value is modulated when the robot arm uses the stick; see resp. Figures 14A-D. The visuo-motor translation in the XY plane when the robot is handling the tool produces a gain modulation that decreases or increases the neurons' activity level.
Hence, the visuomotor coordination changes instantaneously the GF neurons' activity level relative to the current task set and the PPC is dynamically driven by the input activity (not displayed). The neural activity in the PFC map, instead, can evolve autonomously and independently with respect to the input activity, even if the PPC dynamics are presented for a short exposure; this behavior is displayed in the raster plot in Figure 15A.

FIGURE 14 | Dynamics of the gain-field neurons relative to the task. (A-D)
In blue, the robot moves its hand freely. In red, the robot is handling the tool. Depending on what the GF neurons have learned, their peak level will diminish or increase when changing the task (i.e., using a tool). When we expose the PFC neurons to the PPC dynamics for a small period of time-20 iterations every 500 iterations (the segments on the top chart),-the network is able to reconstruct dynamically the rest of the ongoing sequence; see Figure 15B. For instance, the neuron #8 is selective to the particular context of hand-free (blue lines). The contextual information is maintainted as a stable pattern of the neural activity in the working memory and the contexts are accessible and available for influencing the ongoing processing. As a recurrent network, the PFC behaves similarly to a working memory. It embeds the two different strategies depending on the context, even in presence of incomplete inputs and can select to attend or not to the tool.

DISCUSSION
The ability to learn the structure of actions and to select on the fly the proper one given the current task is one great leap in infants cognition. During development, infants learn to differentiate their motor behaviors relative to various contexts by exploring and identifying the correct structures of causes and effects that they can perform by trial and errors. This behavior corresponds to an intrinsic motivation, a mechanism that is argued to drive cognitive development. Besides, Karen Adolph emphasizes the idea of "learning-to-learn" in motor development, an expression akin to Harlow that appears in line with the one of intrinsic motivation. She proposes that two learning mechanisms embody this concept during the development of the motor system-, respectively an associative memory and a category-based memory,-and that the combination of these two learning systems is involved in this capacity of learning-tolearn. Braun et al. (2010) foster a similar concept and suggest that motor categorization requires 1) a critic for learning the structure, i.e., an error-based system, and 2) a learning system that will learn the conditional relationships between the incoming variables; which means, the parameters of the task. They argue that once these parameters are found, it is easier to transfer knowledge from one initial task to many others. All-in-all, we believe that these different concepts on structural learning are important to scaffold motor development and to have intrinsic motivation in one system. Thus the question arises what are the neural mechanisms involved in structural learning and in flexible behaviors?
To investigate this question, we have modeled an architecture that attempts to replicate the functional organization of the fronto-parietal structures, namely, a sensorimotor mapping system, an error-processing system and a reward predictor (Platt and Glimcher, 1999;Westendorff et al., 2010). The fronto-parietal cortices are involved in activities related to observations of alternatives and to action planning, and the anterior cyngulate cortex is a part of this decision-making network. Each of these neural systems contribute to one functional part of it. The ACC system is processing the error-negativity reward to the PPC maps for specialization and to the PFC network for reward prediction. The PPC network organizes the sensorimotor mapping for different tasks whereas the PFC learns the spatio-temporal patterns during the act.
In particular, the PPC is organized around the mechanism of gain-modulation where the gain-fields neurons combine the sensory inputs from each other. We suggest that the mechanism of gain-modulation can implement the idea of structural learning in motor tasks proposed by Wolpert (Braun et al., 2009, 2010). In their framework, the gain-field neurons can be seen as basis functions and as the parameters of the learning problem. It is interesting to note that Braun and al. make a parallel with the bayesian framework, which has been also proposed to describe the gain-field mechanism. For instance, Deneve explains the computational capabilities of gain-fields in the context of the bayesian framework to efficiently represent the joint distribution of a set of random variables (Denève and Pouget, 2004).
Parallely, we used three specific intrinsic mechanisms for enhancing structural learning: the rank-order coding algorithm, the cortical plasticity and an error-based reward. For instance, the rank-order coding algorithm was used to emulate efficiently the so-called spike timing-dependent plasticity to learn spatiotemporal sequences in a recurrent network (Bi and Poo, 1998;Abbott and Nelson, 2000). The PFC system exploits their properties for self-organizing itself by learning the sequences of each task as well as the switch points. PFC neurons learn specific trajectories and at each iteration, a competition process is at work to promote the new steps of the ongoing sequence. Besides, cortical plasticity was modeled in PPC maps with an activity-dependent learning mechanism that promotes the rapid learning of novel (experienced-based) tasks and the stabilization of the old ones. An advantageous side-effect of this mechanism is that PPC neurons become context-dependent, which is a behavior observed also in the reaching neurons of the parieto-motor system, the so-called mirror neurons (Gallese et al., 1996;Brozovic et al., 2007). The results found on cortical plasticity are in line with observations on the rapid adaptation of the body image and of the motor control. Wolpert observed that the motor system incorporates a slow learning mechanism along a fast one for the rapid formation of task sets (Wolpert and Flanagan, 2010). The cortical plasticity is also influenced by an error-based system in ACC that reshape the PPC dynamics with respect to the task. The negative reward permits to inhibit the wrong dynamics but not to elicite the correct ones. Those ones are gradually found by trial and errors, which replicate an exploration process.
We believe that these different mechanisms are important for incremental learning and intrinsic motivation. However, many gaps remain. For instance, a truly adaptive system should show more flexibility during familiar situations than during unfamiliar ones. Retranscribed from Adolph and Joh (2005), a key to flexibility is (1) to refrain from forming automatic responses and (2) to identify the critical features that allow online problem solving to occur. This ability is still missing in current robots. In the context of problem solving in tool-use, Fagard and O'Regan emphasizes the similar difficulty for infants to use a stick for reaching a toy. They also observe that below a certain age, attention is limited to one object only as they just cannot "hold in mind" the main goal in order to perform one subgoal Rat-Fischer et al., 2012). Above this period, however, Fagard and O'Regan observe an abrupt transition in their behaviors when they became capable to relate two actions at a time, to plan consecutive actions and to use recursion. They hypothesize that after 16 months, infants are able to enlarge their focus of attention to two objects simultaneously and to "bufferize" the main goal. We make a parallel with the works of Koechlin and colleagues Koechlin et al. (2003); Collins and Koechlin (2012) who attribute a monitoring role to the frontal cortex for maintaining the working memory relative to the current tasks and for prospecting the different action sequences or episodic memories (Koechlin and Summerfield, 2007), which will be our next steps.