Learned parametrized dynamic movement primitives with shared synergies for controlling robotic and musculoskeletal systems

A salient feature of human motor skill learning is the ability to exploit similarities across related tasks. In biological motor control, it has been hypothesized that muscle synergies, coherent activations of groups of muscles, allow for exploiting shared knowledge. Recent studies have shown that a rich set of complex motor skills can be generated by a combination of a small number of muscle synergies. In robotics, dynamic movement primitives are commonly used for motor skill learning. This machine learning approach implements a stable attractor system that facilitates learning and it can be used in high-dimensional continuous spaces. However, it does not allow for reusing shared knowledge, i.e., for each task an individual set of parameters has to be learned. We propose a novel movement primitive representation that employs parametrized basis functions, which combines the benefits of muscle synergies and dynamic movement primitives. For each task a superposition of synergies modulates a stable attractor system. This approach leads to a compact representation of multiple motor skills and at the same time enables efficient learning in high-dimensional continuous systems. The movement representation supports discrete and rhythmic movements and in particular includes the dynamic movement primitive approach as a special case. We demonstrate the feasibility of the movement representation in three multi-task learning simulated scenarios. First, the characteristics of the proposed representation are illustrated in a point-mass task. Second, in complex humanoid walking experiments, multiple walking patterns with different step heights are learned robustly and efficiently. Finally, in a multi-directional reaching task simulated with a musculoskeletal model of the human arm, we show how the proposed movement primitives can be used to learn appropriate muscle excitation patterns and to generalize effectively to new reaching skills.


INTRODUCTION
Reinforcement Learning of motor skills in robotics is considered to be very challenging due to the high-dimensional continuous state and action spaces. In many studies it has been shown that learning can be facilitated by the use of movement primitives (Schaal et al., 2003;Rückert et al., 2013). Movement primitives are parametrized representations of elementary movements, where typically for each motor skill a small set of parameters is tuned or learned. However, many motor control tasks are related and could be learned more effectively by exploiting shared knowledge. This is a well-known concept in motor neuroscience, where muscle synergies or coherent activations of groups of muscles (d' Avella et al., 2003;d'Avella and Bizzi, 2005;Bizzi et al., 2008) have been proposed to simplify the control problem of complex musculoskeletal systems. In analyzing muscle activation recordings it has been demonstrated that by combining only few muscle activation patterns multiple task instances of natural motor behaviors, e.g., fast reaching movements of humans (d 'Avella et al., 2006), primate grasping movements (Overduin et al., 2008), or walking patterns of infants, toddlers, and adults (Dominici et al., 2011) could be efficiently modeled. One important finding of theses studies is that the dimensionality of the motor control problem can be drastically reduced by reusing common knowledge of related tasks, i.e., grasping objects at different locations using a linear combination of shared muscle synergies. While this has been demonstrated in biological data analysis, only few robotic applications exist that use this shared task knowledge (Chhabra and Jacobs, 2006;Alessandro et al., 2012). These methods demonstrate the advantages of shared synergies in learning robotic tasks. However, different procedures were applied to obtain a parametric description of synergies, i.e., in Chhabra and Jacobs (2006) a variant of non-negative matrix factorization (d' Avella et al., 2003) was used given a set of pre-computed trajectories and in Alessandro et al. (2012) the synergies were extracted from dynamic responses of a robot system with random initialization. In contrast, we propose to learn the synergies representation in a reinforcement learning framework, where task-specific and task-invariant parameters in a multi-task learning setting are learned simultaneously.
In robotics the most widely used approach for motor skill learning are Dynamic Movement Primitives (DMPs) (Schaal et al., 2003;Ijspeert et al., 2013). This approach uses parametrized dynamical systems to determine a movement trajectory and has several benefits. First, as it is a model-free approach, there is no need to learn the typically non-linear, high-dimensional dynamic forward model of a robot (However, this is not the case when inverse dynamics controller are used to compute the control commands). Second, it provides a linear policy parametrization which can be used for imitation learning and policy search (Kober and Peters, 2011). The complexity of the trajectory can be scaled by the number of parameters (Schaal et al., 2003) and one can adapt meta-parameters of the movement such as the movement speed or the goal state (Pastor et al., 2009;Kober et al., 2010). Finally, the dynamical system is constructed such that the system is stable. This simplifies learning since even without modulating the dynamical system the movement trajectory is always attracted by a known (or learned) goal state. However, this parametrization does not allow for reusing shared knowledge, as proposed by the experimental findings studying complex musculoskeletal systems (d' Avella et al., 2003;Bizzi et al., 2008;d'Avella and Pai, 2010). Thus, typically for each motor task an individual movement parametrization has to be learned.
In this paper we propose to use a superposition of learned basis functions or synergies to modulate the stable attractor system of DMPs. This allows for reusing shared knowledge for learning multiple related tasks simultaneously while preserving the benefits of the dynamical systems, i.e., the stability in learning complex motor behavior. The synergies and their activation in time are learned from scratch in a standard reinforcement learning setup. Note that imitation learning could also be applied to implement an initial guess for the synergies, e.g., by using decomposition strategies discussed in d' Avella and Tresch (2001). However, this is beyond the scope of this paper. Moreover, our approach is like the DMPs applicable to discrete and rhythmic movements and allows for modeling time-varying synergies (d' Avella et al., 2006). We therefore denote our approach DMPSynergies. By using for each task a combination of individual, temporally fixed, basis functions DMPs can be modeled as special case of this approach. The benefit of the common prior knowledge is even more drastic when generalizing to new motor tasks given the previously learned basis functions. Thus, for simpler synergies only the weights for the linear combination have to be acquired and for time-varying synergies additionally the time-shift parameters need to be learned. This is demonstrated on a complex walking task and on reaching task using an arm actuated by muscles.
As in previous studies on DMPs (Meier et al., 2011;Mülling et al., 2013) we want to go beyond basic motor skills learning. However, in contrast to those studies that use a library of primitives for sequencing elementary movements (Meier et al., 2011) or mixing basic skills (Mülling et al., 2013), we implement the common shared knowledge among multiple tasks as prior in a hierarchical structure. On the lower level task related parameters, i.e., amplitude scaling weights or time-shift parameters are used to modulate a linear superposition of learned basis functions, the shared higher level knowledge. This has the promising feature that by combining just a small number of synergies diverse motor skills can be generated.
In the Materials and Methods, we will first briefly introduce DMPs (Schaal et al., 2003;Ijspeert et al., 2013) as we build on this approach. We then extend DMPs to allow for reusing shared task knowledge in the form of parametrized synergies. The advantage of the shared knowledge is evaluated in the Results on three multi-task learning scenarios. First, a simple via-point task is used to demonstrate the characteristics of the proposed representation. Then, rhythmic movements are learned in a dynamic 5-link planar biped walker environment. Finally, a musculoskeletal model of a human arm is used to evaluate our primitives on a muscle actuated system learning discrete reaching movements to multiple targets.

DYNAMIC MOVEMENT PRIMITIVES
DMPs generate multi-dimensional trajectories by the use of non-linear differential equations (simple damped spring models) (Schaal et al., 2003). The basic idea is to use for each degree-offreedom (DoF), or more precisely for each actuator, a globally stable, linear dynamical system of the form which is modulated by a learnable non-linear function f . The final position of a movement is denoted by g and the variables y * andẏ * represent the desired state in i.e., joint angles and joint velocities. The time constants α and β are usually pre-defined. The temporal scaling factor τ can be used for de-or accelerating the movement execution as needed. Finally z denotes an internal variable of the dynamical system. For each DoF an individual function f is used which is different for discrete and rhythmic movements. For discrete movements the function f only depends on the phase s, which is an abstraction of time and was introduced to scale the movement duration (Schaal et al., 2003). The function f (s) is constructed of the weighted sum of N Gaussian basis functions n where for discrete movements these Gaussian basis functions are Only the weights w n are parameters of the primitive which can modulate the shape of the movement. The centers or means μ n ∈ [0, 1] specify at which phase of the movement the basis function becomes active. They are typically equally spaced in the range of s and not modified during learning. The bandwidth of the basis functions is given by h 2 n and is typically chosen such that the Gaussians overlap.
For rhythmic movements periodic activation functions are used (Ijspeert et al., 2002). The non-linear function f reads where the periodic phase angle is denoted by φ ∈ [0, 2π]. In Ijspeert et al. (2002) additionally a scalar variable was used to scale the amplitude of the oscillator, which was omitted for simplicity. The basis functions are given by which implement von Mises basis functions. Note that for the periodic basis functions the trajectory in Equation 1 oscillates around the attractor point or goal state g.
Integrating the dynamical systems in Equation 1 for each DoF results into a desired trajectory y * t ,ẏ * t of the joint angles. To follow this trajectory, in the most simple case a linear feedback controller is subsequently used to generate appropriate control commands denoted by u t : For each actuator the linear weights W = [w 1 , . . . , w D ] as well as the control gains k pos and k vel have to be specified, i.e., θ = [W, k pos , k vel ]. This results into ND + 2D parameters for the movement representation, where D denotes the number of actuators or muscles of a system. The simulated trajectory is denoted by y t ,ẏ t . In multi-task learning we want to learn k = 1..K tasks simultaneously. For very simply tasks, such as the via-point experiments described below, it could be sufficient to adapt the goal state g. However, this is usually not the case for more complex motor skill learning tasks in robotics. With DMPs typically for each motor skill an individual movement parametrization θ k has to be learned. However, if we assume similarities among these tasks the learning problem could potentially be simplified by reusing shared knowledge. Inspired by experimental findings in biology (d' Avella et al., 2003;Bizzi et al., 2008;d'Avella and Pai, 2010) we extend these DMPs. Only the parametrization for the non-linear function f (s) for discrete movements or f (φ) for rhythmic movement changes. The dynamical system in Equation 1 and the linear feedback controller in Equation 4 remains the same.

DYNAMIC MOVEMENT PRIMITIVES WITH SHARED SYNERGIES (DMPSynergies)
For learning the kth task, we propose to use a linear combination of temporal flexible basis functions or synergies to parametrize the non-linear function f (s) in Equation 2 or for rhythmic movements f (φ) in Equation 3: where s denotes the phase variable which is only used for discrete movements. As with DMPs ( n in Equation 2) the functions (.) and (.) are different for discrete and rhythmic movements. All K tasks share m = 1..M synergies which are parametrized via the vector θ m . Solely the weights β m, k and the time-shift s m, k are individual parameters for each task. The basic concept of the model is sketched in Figure 1 for a one-dimensional discrete movement.
The complexity of each synergy is controlled by the number of Gaussians for discrete movements or by the number of von Mises basis functions for rhythmic patterns. We denote this number by N, where we parametrize in both cases the amplitude, the mean and the bandwidth. DMPs (Schaal et al., 2003) can be modeled as a special case of this formulation. For DMPs using n = 1..N basis functions the mean μ m, n and the bandwidth h m, n of the basis functions are fixed as discussed in Section 2.1. Solely the n = 1..N amplitudes or weights a m, n are learned. By fixing these parameters and by modeling the non-linear function f (s) for discrete movements or f (φ) for rhythmic movements using a single (M = 1) synergy our representation can be used to implement DMPs.

Multi-dimensional systems
For multi-dimensional systems for each actuator d = 1..D an individual dynamical system in Equation 1 and hence an individual function f (s, k) in Equation 5 or f (φ, k) in Equation 6 is used (Schaal et al., 2003). The phase variable s or φ is shared among all DoF (Note that k = 1..K denotes the task.). Extending our notation for multi-dimensional systems the non-linear function f (s, k) in Equation 5 can be written as Depending on the dimension d different weights β m, k, d , policy vectors θ m,d and time-shift parameters s m, k, d are used. Note that the policy vector θ m,d is task-independent. Interestingly, when implementing additionally dimension-independent policy vectors, i.e., θ m anechoic mixing coefficients (Giese et al., 2009) can be modeled.
Here, we only discuss discrete movement representations, however, the reformulation procedure applies also for rhythmic movement parametrizations. Let us also define a vector notation of f(s, k) where the symbol • denotes the Hadamard product, the elementwise multiplication of vectors. This vector notation is used in the following to compare to existing synergies representations (d' Avella et al., 2003Avella et al., , 2006).

MUSCULOSKELETAL MODELS AND MUSCLE SYNERGIES
We also use the proposed movement representation, the DMPSynergies, to generate muscle excitation patters. These patterns are applied as input in a forward simulation of a musculoskeletal model. A schematic overview of such a simulation is shown in Figure 2. We briefly discuss all processes involved. Muscle synergies for generating muscle excitation patterns are used as input in forward dynamics simulations. In our simulation experiments we evaluate time-varying synergies (d' Avella et al., 2006), which are a particular instance of the DMPSynergies, i.e., the weights β m, k and time-shift parameters s m, k in Equation 9 are independent of the dimension d. Thus, for discrete movements in multi-dimensional systems f (s, k) reads where β m, k is a scalar and the time-shift parameter s m, k is directly added to the phase variable s. This allows for a comparison to e.g., the formulation of time-varying synergies given in d' Avella et al. (2006), where Shared synergies are represented by time-dependent vectors v m (t − t k m ), where in contrast to the proposed DMPSynergies a minor difference is the sign of the time-shift parameter t k m . In this formulation of time-varying synergies (d' Avella et al., 2006) only the time-invariant combination coefficients a k m are task-dependent, whereas the vector v m is task-independent. However, by using task, spatial or temporal (in)variant implementations of the mixing coefficients a or the basis vectors v other representations of synergies (d' Avella et al., 2003;Ivanenko et al., 2004;Giese et al., 2009) can be also implemented.
Activation dynamics model the effect of the delayed force generating process in muscles, as they are not capable of generating force instantaneously. Typically, for each muscle a first order differential equation is used, i.e., (Zajac, 1989).

FIGURE 2 | Forward simulation of musculoskeletal models.
Muscle excitation patterns are used as input, which result in a delayed muscle activity response (activation dynamics). Muscle forces are the result of simulated muscle tendon dynamics, which are typically approximated by a Hill-type contractile element in series with tendon. These muscle forces are used to compute moments of force considering the musculoskeletal geometry. A physics engine is finally used to simulate multibody dynamics which are numerically integrated to generate movement trajectories. Here, f (s, k) denotes the generated muscle excitation signal using e.g., the proposed DMPSynergies. The actual muscle activation is denoted by a, which is a function of the rise time constant τ rise and the fall time constant τ fall . For our evaluations we implemented τ rise = 10 ms and τ fall = 40 ms (Winters and Stark, 1985). Muscle tendon dynamics describe the complex and non-linear force generation properties of muscles. For an approximation a variety of models exist (Zajac, 1989). In these models a muscle is approximated by a number of musculotendinous units, each of which is implemented by a Hill-type contractile element in series with tendon. Characteristic properties of muscles are the optimal fiber length L M 0 , the maximum isometric force F M 0 , and the muscle pennation angle α, which are shown in Table A5 in the appendix for the investigated model of a human arm. The tendon dynamics in this musculoskeletal model were approximated by the muscle model proposed in Schutte et al. (1993).

Frontiers in
Musculoskeletal geometry represents the path of a muscle from its origin to its insertion that can be implemented as a series of straight-line path segments, which pass through a series of via points (Delp et al., 2007). To simulate how muscles wrap over underlying bone and musculature wrapping surfaces i.e., cylinders, spheres and ellipsoids are implemented, where this model is based on the upper extremity model discussed in Holzbaur et al. (2005). A detailed description of the implemented musculoskeletal geometry is given in the supplement (in form of a simulation model file, .osim).
Multibody dynamics are simulated by the physics simulation application OpenSim (Delp et al., 2007;Seth et al., 2011). It is an open source software that already implements a variety of muscle models (Zajac, 1989) and a large number musculoskeletal models are freely available. In our experiments the computational time needed to simulate a movement with a duration of e.g., 500 ms takes between 10 and 20 s (OpenSim implements numerical integrators with an adaptive time step) on a standard computer (3 GHz and 4 GB memory). However, we exploited parallel computing techniques for policy search, which resulted in a gain of factor 10. Alternatively, the muscle dynamics could be approximated via regression methods to speed-up the simulations (Chadwick et al., 2009).

LEARNING WITH MOVEMENT PRIMITIVES
We denote the parametrization of a movement primitive by a policy vector θ. A widely used approach in robotics to learn these parameters is episodic reinforcement learning (Kober and Peters, 2011), which is outlined in Figure 3A. A policy search method is used to improve the movement primitive's representation θ assuming a given objective or reward function C(τ) ∈ R 1 . Throughout this manuscript C(τ) denotes a cost value that is equivalent to the negative reward in classical reinforcement learning (Sutton and Barto, 1998). It indicates the quality of an executed movement. A trajectory τ = y 1:T , u 1:T − 1 is specified by the simulated joint angles y and the applied controls (torques) u, where T denotes the number of time steps. We want to find a movement primitive's parameter vector θ * = argmin θ J(θ) which minimizes the expected costs J(θ) = E [C(τ)|θ]. We assume that we can evaluate the expected costs J(θ) for a given parameter vector θ by performing roll-outs (samples) on the real or simulated system. In other words each movement trajectory is quantified by a single scalar reward C(τ), which can be used by an optimization method to improve the best guess of the movement policy θ.
For learning or optimizing the policy parameters θ a variety of policy search algorithms exist in the motor control literature. Examples are the REINFORCE (Williams, 1992), the episodic Natural Actor Critic (Peters and Schaal, 2008), the Power (Kober and Peters, 2011) or the PI 2 (Theodorou et al., 2010) algorithm, which are reviewed in Kober and Peters (2011). Alternatively, standard optimization tools such as the 2nd order stochastic search methods (Hansen et al., 2003;Wierstra et al.,

FIGURE 3 | Overview of the learning framework. (A)
A parametrized policy θ modulates the output of a movement primitive that is used to generate a movement trajectory τ. The quality of the movement trajectory is indicated by a sparse reward signal C(τ) which is used for policy search to improve the parameters of the movement primitive. For a single iteration the implemented policy search method -Covariance Matrix Adaptation (CMA) (Hansen et al., 2003) is sketched in (B). The parameter space is approximated using a multivariate Gaussian distribution denoted by the ellipses, which is updated (from left to right) using second order statistics of roll-outs or samples that are denoted by the large dots (see text for details).

Frontiers in Computational Neuroscience
www.frontiersin.org October 2013 | Volume 7 | Article 138 | 5 2008; Sehnke et al., 2010) can be used for policy search. These machine learning tools make no assumptions on a specific form of a policy and typically have just a single parameter to tune, the initial exploration rate. We therefore use the stochastic search method Covariance Matrix Adaptation (CMA) (Hansen et al., 2003) for learning the policy parameters in our experiments. Roughly, CMA is an iterative procedure that locally approximates the function C(τ(θ)) by a multivariate Gaussian distribution, which is denoted by the ellipse in the sketch in Figure 3B. From left to right a single optimization step for a two-dimensional policy vector θ = [w 1 , w 2 ] is shown. The colored regions denote the unknown optimization landscape, where solid lines depict equal C(τ) values. From the current Gaussian distribution, denoted by the ellipse in the left panel, CMA generates a number of samples, denoted by the black dots, evaluates the samples (the size of the dots in the center panel is proportional to their C(τ) values), computes second order statistics of those samples that reduced C(τ) and uses these to update the Gaussian search distribution, which is shown in right panel. For algorithmic details we refer to Hansen et al. (2003).
Note that for most interesting robotic tasks the unknown optimization landscape that is also sketched in Figure 3B is multimodal and policy search might converge to a local optimum. Thus, the result of learning is sensitive to the initial policy parameters θ and for evaluating the convergence rate of different policy search methods multiple initial configurations should be considered (Kober and Peters, 2011). However, in this manuscript we evaluate the characteristics of movement primitive representations and put less emphasis on a particular policy search method. As we will demonstrate in our experiments CMA is robust in terms of converging to "good" solutions given the initial values of the evaluated movement primitive representations listed in the appendix.
In our experiments we compare single task learning with DMPs to learning multiple tasks simultaneously with DMPSynergies. With DMPs for each task k = 1..K an individual policy vector θ k is learned, where the objective function used in policy search takes the task index as additional argument, i.e., C(τ, k). For learning multiple tasks simultaneously with DMPSynergies the policy vector θ encodes all K task specific parameters β m, k and s m, k , and all shared parameters denoted by θ m in Equation 5 or Equation 6. The objective function is the sum of the individual task dependent costs

RESULTS
We evaluated the proposed movement representation, the DMPSynergies, with simulations using three multi-task learning scenarios. A simple via-point task was used to illustrate the characteristics of the proposed movement representation. A challenging robotic learning task was used to generate rhythmic walking patterns for multiple step heights. Discrete reaching movements were learned using a musculoskeletal model of a human arm with eleven muscles.

VIA-POINT REACHING TASK WITH A SIMPLE TOY MODEL
The goal of this simple multi-task learning problem is to pass through k = 1..5 via-points (vp k ∈ {0.2, 0.1, 0, −0.1, −0.2}), denoted by the large dots in Figure 4A and navigate to the goal state g at 1. We used a point mass system (1 kg), where the state at time t is given by the position y t and the velocityẏ t . The applied controls u t shown in Figure 4B are computed using the linear feedback control law with k pos = 400 and k vel = 15 specified in Equation 4. The finite time horizon is given by T = 50. For the dynamical system in Equation 1 we used the parameters α z = 2, β z = 0.9 and τ = 0.1. Further parameter settings used for policy search are summarized in Table A1 in the appendix.
This task is specified by the objective function The first two terms punish deviations from the via-point vp k and the goal state g, where y t vp k denotes the position of the state at the time index of the kth via-point. The last term punishes high energy consumption, where u t denotes the applied acceleration.
Note that for simplicity we did not introduce the variable τ denoting the movement trajectory in C(τ, k) in Subsection 2.4. We always add a Gaussian noise term with a standard deviation of σ = 0.5 to the control action to simulate motor noise. We used a single synergy (M = 1) with N = 2 Gaussians to model the shared prior knowledge. The learning curve is shown in Figure 4C, where we compare to single-task learning using DMPs with N = 8 Gaussian basis functions. For the via-point task 8 Gaussians were optimal with respect to the convergence rate, where we evaluated representations using N = 2..20 Gaussians (not shown). Additionally, we compare to an incremental learning setup (DMP inc.) in Figure 4C, where the DMP representation is always initialized with the best learned solution from the previous task. On the x-axis the number of samples or trajectory evaluations on the point mass system is plotted. As it can be seen the proposed approach can benefit from the shared knowledge and has a faster overall learning performance.
In this experiment, for each task we fixed the time-shift s k = 0 and only learned the k = 1..5 weights β k in Equation 5 (Note that the synergy index m was omitted as only a single synergy was used). For each of the N = 2 Gaussians we learned the mean μ, the bandwidth h and the amplitude a in Equation 7. Thus, in total 5 + 2 × 3 = 11 parameters were learned. In contrast with DMPs 8 Gaussian amplitudes were optimized.
The β values of the DMPSynergies representation for the five via-points are shown in Figure 4D for 10 runs. New motor skills can be generated without re-learning via a simple linear interpolation. The resulting trajectories are shown in Figure 4E. However, this is only the case in the simple via-point task. For more complex tasks these β values have to be learned.

DYNAMIC BIPED WALKER TASK
To evaluate the DMPSynergies on a multi-dimensional robotic task we learned multiple walking patterns using a 5 degree-of-freedom (DoF) dynamic biped robot model, which is shown in Figure 5A. We demonstrate that by exploiting the shared knowledge among multiple walking gaits, solutions could be found more robustly and more efficiently in terms of learning speed compared to single task learning. Further, the shared synergies could be used to generalize new skills. The model is only as complex as required to study difficulties like limb coordination, effective underactuation, hybrid dynamics or static instabilities. More details on the design and challenges can be found in Westervelt et al. (2004). The 10-dimensional state q t = [q 1:5 ,q 1:5 ] of the robot is given by the hip angles (q 1 and q 2 ), the knee angles (q 3 and q 4 ), a reference angle to the ground (q 5 ), and the corresponding velocitiesq 1:5 . Only the hip and the knee angles are actuated. Thus, 4 dynamical systems in Equation 1 are used to generate desired trajectories for the linear feedback controller in Equation 4. A phase resetting strategy is implemented to facilitate learning (Nakanishi et al., 2004). At each impact of the swing leg the phase φ in Equation 8 is set to zero. This increases the stability of the robot as the gait cycle duration is implicitly given by the impact time.
The initial state q 1 ∈ R 10 , the goal state g and the control gains in Equation 4 were optimized in advance for a desired step height of r * = 0.2 m to simplify learning. The resulting values are shown in Table A2 in the appendix. For rhythmic movements the goal state g ∈ R 5 models an attractor point which is only specified for joint angles and not for velocities in Equation 1. As for the viapoint reaching task, Gaussian noise with σ = 1 is added to the   f (φ, k) for the walker task.
The learned non-linear functions f (φ, k) are illustrated in the first four rows. The four task instances, i.e., the desired step heights are shown from left to right. For each actuator (left hip, right hip, left knee, and right knee) an individual function f (φ, k) is used that is generated by combining two learned synergies shown in the last two rows. These synergies are shared among multiple task instances and can be scaled and shifted in time (via β m, k and s m, k ). This is indicated by the enclosing rectangles, where the color of the synergies is used to distinguish the four actuators of the walker model.
bars denote the maximum step heights during a single step (0.19, 0.24 and 0.31 m). The objective function for a single walking task is given by the distance travelled in the sagittal plane, the duration of the simulation and deviations from the desired step height r * k with k = 1..4: where x denotes the x-coordinate of the hip, S the number of steps and r i the maximal step height during the ith step. We used a time step t = 2 ms. The time horizon T ∈ [1, 5000] is given by the last valid state of the robot, where the biped does not violate the joint angle constraints specified by q min and q max in Table A2 in the appendix.
With the proposed DMPSynergies the non-linear function f (φ, k) in Equation 6 is generated by combining a set of learned synergies that are shared among multiple task instances, i.e., the four (k = 1..4) desired step heights. This combination mechanism is illustrated for a representation using M = 2 synergies modeled by N = 3 Gaussians in Figure 6. For each actuator (left hip, right hip, left knee, and right knee) an individual function f (φ, k) is generated, which is subsequently used to modulate an attractor system shown in Equation 1 to compute the desired movement trajectories. The shared synergies shown in the last two rows in Figure 6 can be scaled and shifted in time. This is indicated by the enclosing rectangles. Note that the color of the synergies is used to distinguish the four actuators of the walker model.
We evaluated different movement primitive representations with increasing complexity compared to single-task learning using DMPs with N = 4 and N = 8 Gaussians. The average final costs C mean after learning over 10 runs are shown in Table 1. In the most simple representation we used M = 2 synergies modeled by N = 2 Gaussians. More complex representations implementing time-varying synergies are denoted by the symbol = 1 in Table 1. Here, additionally the time-shifts s 1:M were learned for all synergies and all actuators. However, the final learning performance did not outperform the representation with fixed time-shifts (i.e., M = 2, N = 3 and = 0: −21.4 ± 0.4 compared to M = 2, N = 3 and = 1: −20.5 ± 1.4). This can be also seen in Figure 7, where we plot the learning curve for synergies with s 1:M = 0 in Figure 7A and the results for time-varying synergies in Figure 7B. The average final cost value of the DMP representation is higher (i.e., DMP N = 8 : −14.8 ± 1.4) compared to the best costs achieved with shared synergies (M = 2, N = 3 and = 0: −21.4 ± 0.4). This holds also for an incremental learning setup (e.g., DMP inc. N = 4 : −19.2 ± 0.6), where DMPs were initialized with the best result from the previous task.
The joint angle trajectories of the left hip and knee joint for the DMPSynergy representation using M = 2 synergies modeled by N = 3 Gaussians and = 1 are illustrated in Figure 8. The average step heights were r ∈ {0.22, 0.22, 0.26, 0.28}, which do not match the desired step heights r * ∈ {0.15, 0.2, 0.25, 0.3}. The reason for this is that the objective function in Equation 11 is designed to prefer correct multi-step walking movements over exact matches of the step heights since learning to walk is already a complex learning problem (approximately 90% of the costs are determined by the travelled distance and only 5% are caused by the distance to the desired step heights). However, for the different desired step heights the shape of the trajectories as well as the moment of the impact vary. The moments of impact are denoted by arrows in denotes an incremental learning setup, where DMPs were initialized with the best result from the previous task. Generalizing to a new step height (r * = 0.1 m) is shown in (C), where we applied the best learned policy for DMPSynergies from (B) and only optimized the weights β 1:2 for the two (fixed) synergies. The corresponding average step height over all steps is shown in (D). We compare to DMPs with N = 8 Gaussians. While generalizing to new motor skills was straightforward for the simple via-point task, for the walking tasks a linear interpolation turns out to be ineffective. We therefore demonstrate in Figures 7C,D how a new walking pattern for a desired step height of r * = 0.1 m can be learned be reusing the previously learned prior knowledge (taking the best solution for r * = 0.25 m) for M = 2 synergies modeled by N = 3 Gaussians and = 1. Only the weights β 1:M are optimized in this experiment, keeping the learned time-shifts fixed. The costs in Figure 7C and the average step height r in Figure 7D demonstrate the advantage of using a fixed prior, where we compare to DMPs with N = 8 Gaussians.

MULTI-DIRECTIONAL REACHING TASK WITH A MUSCULOSKELETAL MODEL OF THE HUMAN ARM
A simplified model of a human arm based on the model by Holzbaur et al. (2005) was used to learn six reaching tasks simultaneously. The shoulder and the elbow joint were modeled by hinge joints. Thus, only movements in the sagittal plane were possible. The initial arm configuration and the six target locations (with a distance of 15cm to a marker placed on the radial stylion) are shown in Figure 9A. A learned example movement is illustrated in Figure 9B, where the cylinders, the spheres and the ellipsoids denote wrapping surfaces discussed in Subsection 2.3. We focused on fast reaching movements of 500 ms duration (T = 500 and t = 1 ms) that can be implemented in an open-loop control scheme. Note that with our approach also closed-loop systems with feedback could be implemented, as discussed below. Thus, the learnable non-linear function f(s, k) in Equation 10 is directly used as input to the system in form of muscle excitation patterns. The parameter settings for learning are shown in Table A4 in the appendix.
For learning the reaching tasks we evaluated the Euclidean distance of a marker v k (t) placed on the radial stylion to a given target g k , where k = 1..6 denotes the task index. Additionally, large muscle excitations signals are punished: where . denotes the Euclidean distance between the marker v k (t) and the target g k at time t. We evaluated five movement representations, defined in Equation 10, with an increasing number of shared synergies, i.e., M = {1, 2, 3, 4, 5}. Each synergy is represented by a single (N = 1) Gaussian. For each target and for each synergy the task-specific parameters β k,m and s k,m are learned. The number of taskspecific and the number of task-invariant or shared parameters is shown in Table 2.

Frontiers in Computational Neuroscience
www.frontiersin.org October 2013 | Volume 7 | Article 138 | 9 FIGURE 9 | Musculoskeletal model for learning reaching tasks. A model of a human arm with eleven muscles shown in Table A5 in the appendix was used to learn six reaching skills in the sagittal plane (A).
As reward signal we encoded the distance to a marker placed on the radial stylion (denoted by the plus symbol) and punished large muscle excitation signals. Targets are denoted by large dots. We focused on fast reaching skills of 500 ms duration, where an example movement is shown in (B). To simulate how muscles wrap over underlying bone and musculature wrapping surfaces are implemented as cylinders, spheres and ellipsoids (Holzbaur et al., 2005). We hypothesized that a muscle excitation signal can be generated by combining a small number of learned synergies. An example for the anterior deltoid muscle (DeltA) is shown in Figure 10 for two movement directions. Here, DMPSynergies with M = 4 synergies were used to generate the muscle excitation patterns. The muscle excitation patterns for all six movement directions and all eleven muscles are shown in Figure 11. Two observations can be made: first, as our objective function in Equation 12 punishes large muscle excitation signals a sparse representation of multiple motor skills is learned. Second, the learned muscle patterns partially show the typical triphasic behavior of human movement (Angel, 1975;Hallett et al., 1975;Berardelli et al., 1996;Chiovetto et al., 2010), where individual muscles (e.g., DeltA, PectClav and BRA in the first column in Figure 11) become activated at the onset of the movement, shortly before the velocity peak to decelerate, and finally, multiple muscles cocontract at the target location. These three events are denoted by the labels 1, 2, and 3 in the last row in Figure 11, where a threshold of 2 cm s −1 was used to determine the movement onset and the termination of the movement.
Comparing all five movement representations (M = {1, 2, 3, 4, 5}), we found that at least three synergies were necessary to accomplish all reaching tasks. This is shown in Figure 12A, where with only one (M = 1) or two synergies (M = 2) not all targets can be reached. Shown are the marker trajectories of three independent learning sessions (out of ten FIGURE 10 | Synergy combination mechanism. We hypothesize that a muscle excitation signal can be generated by combining a small number of learned synergies. Here, we illustrate this combination process for the deltoid anterior (DeltA) with four synergies for two movement directions. For the two movement directions different combination coefficients β m, k and different time-shift parameters s m, k were learned. The synergies are represented by a single parametrized Gaussian, where the corresponding basis function for DeltA is denoted by a bold line in the enclosing rectangles. runs). Note that similar findings were obtained in analyzing human arm reaching movements, where four to five synergies were observed (d' Avella et al., 2006). The corresponding learning curves for all five movement representations are shown in Figure 12B, where the parametrizations with M = 3..5 synergies perform equal. This is also reflected in the final costs shown in Table 2 (rows 3..5). As an example the marker trajectories and the tangential velocity profiles for the representation using M = 4 synergies are illustrated in Figure 12C. As we evaluated an open-loop control scheme these marker trajectories did not exactly terminate at the target location (after the limited number of episodes for learning). However, by increasing the number of episodes or by adding feedback the terminal accuracy could be improved.
FIGURE 11 | Learned muscle excitation patterns. Shown are the muscle excitation patterns for all six targets (from left to right) and for all muscles (rows one to eleven). In the last row the tangential velocity profiles of the marker placed on the radial stylion is illustrated (see text for details).
For testing the generalization ability of DMPSynergies we rotated all six targets by 30 degrees and only re-learned the task-specific coefficients, i.e., the mixing coefficients β m, k and the time-shift parameters s m, k . Interim solutions with a movement representation implementing M = 4 synergies are shown in Figure 13A. Note that, as we evaluated an open-loop controller, the rotated targets were unknown to the controller. Solely the objective function in Equation 12 quantifies deviations from the targets. After 15 episodes a first trend toward the new targets was visible, however, most of the trajectories (three learned solutions are illustrated) ended at the original targets. The corresponding learning curves for DMPSynergies with three (M = 3) and four (M = 4) synergies are shown in Figure 13B. The learning curve for the unperturbed scenario from the previous experiment is denoted by the dashed line (M = 4 orig.). Note that in boththe unperturbed and the perturbed experiments K = 6 reaching movements were learned, which demonstrates the benefit of the shared learned knowledge when generalizing new skills. For a comparison the blue line denoted by DMP N = 4 illustrates the convergence rate of single task learning with DMPs, where DMPSynergies (M = 4 orig.) can compete in terms of learning speed.

DISCUSSION
We proposed a movement representation based on learned parametrized synergies (DMPSynergies) that can be linearly combined and shifted in time. These learned synergies are shared among multiple task instances significantly facilitating learning of motor control policies. This was demonstrated on simulated robotic and on musculoskeletal systems. Below we discuss the significance and the implication of our findings with respect to robotics and biological motor control.

EXPLOITING SHARED SYNERGIES FOR MOTOR SKILL LEARNING IN ROBOTICS
For motor skill learning in robotics a common strategy is to use parametrized elementary movements or movement primitives (Kober and Peters, 2011). In this paper we proposed  a generalization of the most widely used movement primitive representation in robotics, dynamic movement primitives (DMPs) (Schaal et al., 2003;Ijspeert et al., 2013). DMPs evaluate parametrized dynamical systems to generate trajectories.
The dynamical system is constructed such that the system is stable. This movement representation has many advantages. It is a model-free approach, partially explaining its popularity in robotics as model learning in high-dimensional stochastic robotics systems is challenging. Further, its stable attractor system facilitates learning and DMPs can represent both rhythmic and discrete movements. Meta parameters can be used for adapting the movement speed or the goal state. Finally, the movement representation depends linearly on the policy parametrization, i.e., the learnable function f depends linearly on the parameters θ of the movement primitive: f (s) = (s) T θ, where s is the time or phase variable. As a result, imitation learning for DMPs is straightforward, as this can simply be done by performing linear regression (Schaal et al., 2003). However, for each task k an individual set of parameters θ k has to be learned, which unnecessarily complicates learning of a large number of related motor skills.
In contrast we proposed a generalization that allows for reusing shared knowledge among multiple related motor skills, i.e., the parameter vector θ is task-invariant.
In particular, we replaced the non-linear modulation function f (.) in DMPs by a hierarchical function approximator. On the lower level task related parameters (amplitude scaling weights and time-shift parameters) are used to modulate a linear superposition of basis functions. These basis functions encode shared higher level knowledge and are modeled by a mixture of Gaussians. With the proposed DMPSynergies representation discrete and rhythmic movements can be generated. By using Gaussians at the higher level DMPs can be implemented as special case. However, the DMPSynergies can compete with DMPs in terms of learning efficiency while allowing for learning multiple motor skills simultaneously.
This was demonstrated in two robotic multi-task learning scenarios, where we showed that, with the DMPSynergies, good policies could be found more reliably, i.e., local minima with high cost values were more often avoided, more efficiently (fewer samples were needed), and new skills could be generalized by exploiting the previously learned shared knowledge. A simple via-point task was used to demonstrate the characteristics of the approach, where the proposed movement representation could be used to generalize new movement trajectories by applying a linear interpolation on the synergy's weights β. In a second robotic task, a biped walker task, the hierarchical representation was used to learn walking patterns with multiple step heights. In this complex reinforcement learning task, it was shown that better solutions were found more reliably by exploiting the learned shared knowledge, which is a strong feature of a movement representation. While also with the classical DMP approach high quality movement gaits were learned, on average the achieved costs were higher compared to the proposed hierarchical synergies representation, i.e., −19.2 ± 0.6 for DMPs with 4 Gaussians (and with incremental learning) compared to −21.4 ± 0.4 when using M = 2 synergies with N = 3 Gaussians (where the time-shift parameters were fixed and set to zero, = 0). In this experiment 10, 000 samples were needed to learn 4 walking gaits simultaneously, where the DMPSynergies approach can compete with DMPs (15, 000 samples). Additionally, we demonstrated in a generalization experiment that walking patterns for an unknown step height (r * = 0.1 m) could be learned with 100 samples by exploiting the previously learned prior knowledge.
While DMPs (Schaal et al., 2003;Ijspeert et al., 2013) are most closely related to our shared synergies approach, there exist a few other approaches (Chhabra and Jacobs, 2006;Alessandro et al., 2012) also implement shared knowledge. In Chhabra and Jacobs (2006) a variant of non-negative matrix factorization (d' Avella et al., 2003) was used to compute the synergies given a set of trajectories created by applying stochastic optimal control methods (Li and Todorov, 2004). In Alessandro et al. (2012) an exploration phase was introduced to compute the dynamic responses of a robot system with random initialization. After a reduction phase, where a small number of proto-tasks were executed, a reduced set of dynamic responses was used to compute the synergies matrix by solving a linear system of equations. We proposed an alternative for learning the synergies and their combination parameters, where all unknowns are learned in a reinforcement learning setting from a single sparse reward signal. Moreover, for robotic tasks we embed the synergies approach in stable dynamical systems like in DMPs. This combines the benefits of DMPs and muscle synergies, namely the efficient learning ability of DMPs in high-dimensional systems and the hierarchical representation of movements that can be used for multi-task learning.
As with DMPs the complexity of the DMPSynergies representation can be scaled by the number of combined synergies or the number of implemented Gaussians modeling these synergies. However, as the trajectories generated with our representation depend non-linearly on the policy parameters (in contrast to DMPs) more sophisticated decomposition strategies like for example d' Avella and Tresch (2001); Chiovetto et al. (2013) are needed for imitation learning. With such approaches the extracted synergies could be implemented as initial solutions in our learning framework.

LEARNED SHARED SYNERGIES FOR BIOLOGICAL MOVEMENT GENERATION
The idea of reusing shared knowledge for movement generation is a well-known concept in biological motor control. Muscle activation patterns recorded during multiple task instances of natural motor behavior, i.e., fast reaching movements of humans (d' Avella et al., 2006), primate grasping movements (Overduin et al., 2008), or walking patterns (Dominici et al., 2011), could be efficiently modeled by combining only few muscle activation patterns. In particular, time-varying muscle synergies (d' Avella et al., 2003;Bizzi et al., 2008) were proposed to be a compact representation of muscle activation patterns. The key idea of this approach is that muscle activation patterns are linear sums of simpler, elemental functions or synergies. Each muscle synergy can be shifted in time and scaled with a linear factor to construct a large variety of activation patterns. In this manuscript we proposed a generative model to represent and learn time-varying synergies (d' Avella et al., 2006). The proposed framework allows for studying the concept of muscle synergies from a generative perspective in contrast to the analytical approach, where muscle synergies are identified from observed data. Applying such a generative approach to a musculoskeletal model, we could provide a proof-of-concept of the feasibility of a low-dimensional controller based on shared synergies and a demonstration of its learning efficiency. Moreover, we could ask different question, i.e., how does performance scale with the complexity of the movement representation, how sparse is the encoding of the muscle patterns to solve particular tasks, and how well does the learned representation generalize to new movements? We addressed these questions in a multidirectional reaching task, where we investigated a musculoskeletal model of the upper limb with 11 muscles. Motor skills for 6 reaching directions were learned within 3000 episodes and by exploiting the learned shared synergies movements for rotated target directions can be generalized 3 times faster (Figure 13). We found that a minimum of three synergies were necessary to solve the task (Figure 12B). In our objective function large muscle excitation signals were punished, which resulted in a sparse representation of muscle excitation patterns. This sparse representation illustrated in Figure 11 shows similarities to observed electromyographic activity recorded in related human reaching tasks (d' Avella et al., 2006), i.e., triphasic muscle patterns, where some of the muscles contributed at the movement onset, some at point of the maximum tangential velocity, and some at the end of the movement to co-contract. However, sensor feedback might be an important modulation signal to make this effect more pronounced.
The model was designed to capture salient features of human musculoskeletal system, such as muscle activation dynamics, Hilltype musculotendinous units, realistic geometry. However, to reduce the computational effort needed to simulate a movement we made a few simplifying assumptions. First, a limited number of muscles (11) were implemented, where simplified wrapping objects and muscle paths were modeled. Further, we implemented the shoulder and the elbow joint as hinge joints. Thus, only reaching movements in the sagittal plane could be performed. Finally, we focused on fast reaching movements in an open-loop control scheme. This was a valid assumption for comparing to human data for fast reaching movements (d' Avella et al., 2006). However, our proposed learning and control framework also allows for implementing closed-loop controllers, i.e., when introducing an inverse kinematics model ∈ R Dx3 in Equation 1, i.e., τż = (α z (β z (g − y * ) − z)) + f, where D denotes the number of actuators and we assumed that the goal state g lives in a threedimensional Cartesian space. The inverse kinematics model maps the feedback error signal into the muscle pattern space and modulates the learned muscle excitation basis f ∈ R D . With such closed-loop systems we might better understand the contribution of feedback to muscle control in biological movement generation (Lockhart and Ting, 2007).
Musculoskeletal models have been used before to investigate movement generation with muscle synergies (Berniker et al., 2009;Neptune et al., 2009;McKay and Ting, 2012). Berniker and colleagues used model-order reduction techniques to identify synergies as a low-dimensional representation of a non-linear system's input/output dynamics and optimal control to find the activations of these synergies necessary to produce a range of movements. They found that such a set of synergies was capable of producing effective control of reaching movements with a musculoskeletal model of a frog limb and that it was possible to build a relatively simple controller whose overall performance was close to that of the system's full-dimensional non-linear controller. Neptune and colleagues generated muscle-actuated forward dynamics simulations of normal walking using muscle synergies identified from human experimental data using nonnegative matrix factorization as the muscle control inputs. The simulation indicated that a simple neural control strategy involving five muscle synergies was sufficient to perform the basic sub-tasks of walking. McKay and Ting, studying an unrestrained balance task in cats, used a static quadrupedal musculoskeletal model of standing balance to identify patterns of muscle activity that produced forces and moments at the center of mass (CoM) necessary to maintain balance in response to postural perturbations. CoM control could be accomplished with a small number of muscle synergies identified from experimental data, suggesting that muscle synergies can achieve similar kinetics to the optimal solution, but with increased control effort compared to individual muscle control. In line with these simulation studies, we also found that a small number of muscle synergies was sufficient to perform multiple reaching tasks in a forward dynamic simulation of a musculoskeletal model. However, we did not use experimental data or model-order reduction techniques to identify muscle synergies. In our framework, both synergy structural parameters and synergy combination parameters were found with reinforcement learning, supporting the generality of the solutions identified. Moreover, we were able to test the generalization ability of the synergies in the same framework by optimizing only the task-specific synergy combination parameters. The proposed reinforcement learning framework with movement primitives relates to optimal control approaches in the biological motor control literature (Delp et al., 2007;Erdemir et al., 2007). In these simulation studies muscle patterns are parametrized by e.g., bang-bang (on-off) controls, constant control values, or control vectors approximated with polynomials [see Table 2 in Erdemir et al. (2007) for an overview of different control strategies]. However, to the best of our knowledge non of these approaches implemented shared synergies as control signal representation for learning multiple task instances simultaneously. Even with complex representations, e.g., with M = 5 synergies learning 225 parameters converged within 3000 episodes, which is a promising feature of the proposed approach for studies on more complex musculoskeletal models.
In this manuscript we demonstrated how time-varying synergies (d' Avella et al., 2006) can be implemented and learned from scratch. Interestingly, by adding an additional constraint on the movement representation, i.e., by using a single policy vector for all actuators anechoic mixing coefficients (Giese et al., 2009) can be implemented. However, in general any synergy representation such as synchronous synergies (Ivanenko et al., 2004;Dominici et al., 2011) used for locomotion can be learned. Thus, we do not argue for a particular synergy representation. Our representation was motivated to extend the widely used DMPs (Schaal et al., 2003) for exploiting shared task-invariant knowledge for motor skill learning in robotics.

CONCLUSION
We proposed a movement primitive representation implementing shared knowledge in form of learned synergies. The representation competes with the state-of-the-art, it can implement DMPs (Schaal et al., 2003) as a special case, and it allows for an efficient generalization to new skills. Importantly, shared knowledge simplifies policy search in high-dimensional spaces, which was demonstrated in a dynamic biped walking task. Further, the proposed learned synergies are a compact representation of high-dimensional muscle excitation patterns, which allows us to implement reinforcement learning in musculoskeletal systems. In such frameworks muscle patterns are learned from scratch using a sparse reward signal, where we could investigate how muscles and muscle synergies contribute to a specific task, how complex a task-invariant representation must be, and how well the learned synergies generalize to changes in the environment. In a multi-directional arm reaching experiment we provided first insights to these questions. In future research the proposed movement generation and learning framework will be used to study feedback signals and feedback delays, imitation learning from biological data and the effect of simulated muscle surgeries.