Bridging Reinforcement Learning and Iterative Learning Control: Autonomous Motion Learning for Unknown, Nonlinear Dynamics

This work addresses the problem of reference tracking in autonomously learning robots with unknown, nonlinear dynamics. Existing solutions require model information or extensive parameter tuning, and have rarely been validated in real-world experiments. We propose a learning control scheme that learns to approximate the unknown dynamics by a Gaussian Process (GP), which is used to optimize and apply a feedforward control input on each trial. Unlike existing approaches, the proposed method neither requires knowledge of the system states and their dynamics nor knowledge of an effective feedback control structure. All algorithm parameters are chosen automatically, i.e. the learning method works plug and play. The proposed method is validated in extensive simulations and real-world experiments. In contrast to most existing work, we study learning dynamics for more than one motion task as well as the robustness of performance across a large range of learning parameters. The method’s plug and play applicability is demonstrated by experiments with a balancing robot, in which the proposed method rapidly learns to track the desired output. Due to its model-agnostic and plug and play properties, the proposed method is expected to have high potential for application to a large class of reference tracking problems in systems with unknown, nonlinear dynamics.

The task consists of having the output y follow the reference r ∈ R over a finite horizon of N = 100 samples with The feedforward control strategy consists of applying an input trajectory The input values are determined by optimization such that the squared tracking error is minimized, i.e., The feedback control strategy consists of a generic, non-linear function to ensure that performance is not limited by the structure of the feedback law. In particular, the input values u FB are computed as the sum of ten polynomials of tenth order, to which the current and nine previous error samples serve as inputs, i.e., ∀n ∈ [1, N ], The set of feedback parameters K = {k ij | i, j ∈ [1, 10]} is determined via optimization such that the squared tracking error is minimized, i.e., 1

A.3 Feedback Control of the TWIPR
Consider the dynamics of the TWIPR moving along a straight line. The robot has two degrees of freedom, namely, the pitch angle Θ ∈ R and the position s ∈ R. The state vector follows with The motor torque serves as input variable and is denoted by u ∈ R. To stabilize the TWIPR in its upright equilibrium, the nonlinear dynamics are approximated by a linear, discrete-time model with state vector x ∈ R 4 of the form ∀n ∈ N, x(n + 1) = Ax(n) + Bu(n) using a sampling period of T = 0.02 seconds. The stabilizing control input u C ∈ R is computed by linear state feedback of the form where the feedback matrix K is designed by LQR [1].
To track the desired reference maneuvers, the feedback input u C is superposed by a learned feedforward input u L leading to the overall input

A.4 Policy Gradient Implementation
In this section, we briefly outline the implementation details of the finitedifference policy gradient method that was used as a baseline comparison in Section 4.2. For a detailed discussion of the method and its implementation, see [2]. The finite-difference gradient estimation was chosen because this method is expected to be highly efficient due to the deterministic nature of the simulations, see [2]. In order to apply the policy gradient scheme to the learning task of Section 4.2, the policy is defined as the input trajectory u j , and the reward of a trial reward is defined as On each iteration, the policy is updated by where ∇R j is an estimate of the reward's gradient with respect to the input trajectory, and α ∈ R is a step-size. To estimate the gradient, W ∈ N roll-out trials with the perturbed policies u j + ∆ w are performed, and the gradient is determined by least-squares estimation, as detailed in [2]. In the simulations, the step-size was chosen as α = 50, one roll-out per trial, i.e., W = 1, was used, and the policy permutations were drawn according to ∆ w ∼ N (0, 0.001I) .
In contrast to the method proposed in this paper, the parameters of the policy gradient scheme had to be tuned manually and were chosen to yield a satisfying trade-off between a fast speed of learning and robust convergence for all three reference trajectories.