Time and Action Co-Training in Reinforcement Learning Agents

In formation control, a robot (or an agent) learns to align itself in a particular spatial alignment. However, in a few scenarios, it is also vital to learn temporal alignment along with spatial alignment. An effective control system encompasses flexibility, precision, and timeliness. Existing reinforcement learning algorithms excel at learning to select an action given a state. However, executing an optimal action at an appropriate time remains challenging. Building a reinforcement learning agent which can learn an optimal time to act along with an optimal action can address this challenge. Neural networks in which timing relies on dynamic changes in the activity of population neurons have been shown to be a more effective representation of time. In this work, we trained a reinforcement learning agent to create its representation of time using a neural network with a population of recurrently connected nonlinear firing rate neurons. Trained using a reward-based recursive least square algorithm, the agent learned to produce a neural trajectory that peaks at the “time-to-act”; thus, it learns “when” to act. A few control system applications also require the agent to temporally scale its action. We trained the agent so that it could temporally scale its action for different speed inputs. Furthermore, given one state, the agent could learn to plan multiple future actions, that is, multiple times to act without needing to observe a new state.


INTRODUCTION
A powerful formation control system requires continuously monitoring the current state, comparing the performance, and deciding whether to take necessary actions. This process does not only need to understand the system's state and optimal actions but also needs to learn the appropriate time to perform an action. Deep reinforcement learning algorithms which have achieved remarkable success in the field of robotics, games, and board games have also been shown to perform well in adaptive control system problems Li et al. (2019); Oh et al. (2015); Xue et al. (2013). However, the challenge of learning the precise time to act has not been directly addressed.
The ability to measure time from the start of a state change and use it accordingly is an essential part of applications such as adaptive control systems. In general, the environment encodes as four dimensions: the three dimensions of space and the dimension. The role of representation of time affects the decision-making process along with the spatial aspects of the environment Klapproth (2008). However, in the field of reinforcement learning (RL), the essential role of time is not explicitly acknowledged, and existing RL research mainly focuses on the spatial dimensions. The lack of time sense might not be an issue when considering a simple behavioral task, but many tasks in control systems require precisely timed actions for which an artificial agent is required to learn the representation of time and experience the passage of time.
Research on time representation has yielded several different supervised learning models such as the ramping firing rate Durstewitz (2003), multiple oscillator models Matell et al. (2003); Miall (1989), diffusion models Simen et al. (2011), and the population clock model Buonomano and Laje (2011). In some of these models, such as the two presented in the studies by Hardy et al. (2018) and Laje and Buonomano (2013), timing relies on dynamic changes in the activity patterns of neuron populations. More specifically, it relies on nonlinear firing rate neurons connected recurrently, and research has shown that these models are the most effective Buonomano and Laje (2011) and the best at accounting for timing and temporal scaling compared to other available models. Extending this work on a rote sense of time for agents, we used a population clock model recurrent neural network (RNN) consisting of nonlinear firing rate neurons as our timing module and trained a reinforcement learning agent to create its own representation of time.
It is arguable that a traditional artificial neural network, such as a multilayer perceptron, which was proven to learn complex spatial patterns, could also be used to learn time representation. However, these networks might not be well suited to perform a simple interval-discrimination task, due to the lack of the implicit representation of time Buonomano and Maass (2009). One argument is that a traditional artificial neural network processes inputs and outputs as a static spatial pattern. However, to achieve an effective control system, the agent needs to continuously process the state of the system. For instance, if we want an agent to process continuous-time input, such as a video in a game, we divide the input into multiple time-bins. Similarly, deep neural network (DNN) models with long short-term memory (LSTM) units Hochreiter and Schmidhuber (1997) or gated recurrent units (GRUs) Chung et al. (2014) can implicitly represent time by allowing the state of the previous time step to interact with the state of the current time step. These networks still treat time as a spatial dimension because they expect the input to be discretised into multiple time bins Bakker (2002) Buonomano and Maass (2009). Because these networks treat time as a spatial dimension, they might lack explicit time representation.
Through the lens of RL algorithms, the problem of discretising input into multiple time bins can be explained as follows. Given the current state of the environment S t , a DNN function approximator (for example, a policy network) outputs an action at A t at every time step t. If an action A t is more valuable when executed at time t + δx or t − δy, then to effectively maximize the summation of future rewards, we should further divide the input into smaller time steps. By dividing these time steps more finely, an agent could learn the true value of the state, although at the expense of a higher computation cost and with increased state value variance Petter et al. (2018). A few studies Carrara et al. (2019); Tallec et al. (2019); Doya (2000) have elegantly extended reinforcement learning algorithms to continuous time and state space, which generalizes the value function approximators over time. However, if an agent has developed a representation of the time, it could learn to explicitly encode the optimal time intervals itself and in turn, learn to decide when to act. In this study, we present the model of how the time representation is learned and the subsequent encoding process could take place.
In this research, we have developed a new scenario called "task switching," where an agent is presented with multiple circles to click (task), and each circle should be clicked within a specific time window in a specific order. This scenario attempts to encapsulate both spatial and timing decisions. This task was built analogous to a multi-input multi-output (MIMO) system in process control tasks, where the system should compare the state of the current system and decide when making parameter changes to the system. This research aims to investigate the co-learning of decision making and development of timing by an artificial agent using a reinforcement learning framework. We achieve this by disentangling the process of learning optimal action (which circle to click) and time representation (when to click a circle). We designed a novel architecture that contains two modules: 1. a timing module that uses a population clock model, a recurrent neural network (RNN) consisting of nonlinear firing rate neurons, and 2. an action module that employs a deep Q-network (DQN) Mnih et al. (2015) to learn the optimal FIGURE 1 | Task-switching scenario with four circles. Circle 1 (in blue) must be clicked at a time point between 800 and 900 ms from the start of the experiment. Circles 2 (in green), 3 (in orange), and 4 (in yellow) must be clicked in the 2,300-2,400, 3,300-3,400, and 1,500-1,600 ms intervals, respectively. action given a specific state. The RNN and DQN are co-trained to learn the time to act and action. The RNN was trained using a reward-based recursive least squares algorithm, and the DQN was trained using the Bellman equation. The results of a series of taskswitching scenarios show that the agent learned to produce a neural trajectory reflecting its own sense of time that peaked at the correct time-to-act. Furthermore, the agent was able to temporally scale its time-to-act more quickly or more slowly according to the input speed. We also compared the performance of the proposed architecture with DNN models such as LSTM, which can implicitly represent time. We observed that for tasks involving precisely timed action, neural network models such as the population clock model perform better than the LSTM. This article first presents the task-switching scenario and describes the proposed architecture and training methodology used in the work. Section 3 presents the performance of the trained RL agent on six different experiments. In Section 4, we present the performance of LSTM in comparison with the proposed model. Finally, Section 5 presents an extensive discussion about the learned time representation with respect to prior electrophysiology studies.

Task-Switching Scenario
In the scenario, there are n different circles, and the agent must learn to click on each circle within a specific time interval and in a specific order. This task involves learning to decide which circle to click and when that circle should be clicked. Figure 1 shows an example scenario with four circles. Circle 1 must be clicked at some point between 800 and 900 ms. Similarly, circles 2, 3, and 4 must be clicked at 1,500-1,600, 2,300-2,400, and 3,300-3,400 ms, respectively. If the agent clicks the correct circle in the correct time period, it receives a positive reward. If it clicks a circle at the incorrect time, it receives a negative reward (refer to Table 1 for the exact reward values). Each circle becomes inactive once its time interval has passed. For example, circle 1 in Figure 1 becomes inactive at 901 ms, meaning that the agent cannot click it after 900 ms and receives a reward of 0 if it attempts to click the inactive circle. Each circle can only be clicked once during an episode.
The same scenario was modified to conduct the following experiments: • Co-training time and action in a reinforcement learning agent on a simple task-switching scenario. • Temporal scaling: the time intervals of each circle occur at different speeds. For instance, at Speed 2, circle 1 in Figure 1 must be clicked between 750 and 850 ms; similarly, circles 2, 3, and 4 must be clicked at 1,450-1,550, 2,250-2,350, and 3,250-3,350 ms, respectively. • Multiple clicks: one circle should be clicked multiple times without any external cue. For instance, after circle 1 is clicked and without any further stimulus input, the agent should learn to click the same circle after a fixed time interval. • Twenty circles: To understand if the agent can handle a large number of tasks, we trained the agent on a scenario containing 20 circles. • Skip state: in the task-switching scenario, the learned timeto-act should be a state-dependent action. In other words, when the state input is eliminated, the agent should not perform an action. For instance, if circle 4 in Figure 1 is removed from the state input, the agent should skip clicking on circle 4.

Framework
To disentangle the learning of temporal and spatial aspects of the action space, the temporal aspect being when to act and the spatial being what to act on, we used two different networks: a DQN to learn which action to take and an RNN which learns to produce a neural trajectory that peaks at the time-to-act.

Deep Q-Network
In recent years, RL algorithms have given rise to tremendous achievements Vinyals et al. (2019); Mnih et al. (2013); Silver et al. (2017). RL manifests as a Markov decision process (MDP) defined by the state space S, the action space A, and the reward function R: S × A → R. At any given time step t, the agent receives a state s t ∈ S, which it uses to select an action a t ∈ A and execute that action on the environment. Next, the agent receives a reward r t+δt ∈ R, and the environment changes from state s t to s t+δt ∈ S. For each action the agent performs on the environment, it collects (s t , a t , r t+δt , s t+δt ), also called an experience tuple. An agent learns to take actions that maximize the accumulated future rewards, which can be expressed as R t as follows: Frontiers in Control Engineering | www.frontiersin.org August 2021 | Volume 2 | Article 722092 where cϵ[0, 1 ] is the discount factor that determines the importance of the immediate reward and the future reward. If c 0, the agent will learn to choose actions that produce an immediate reward. If c 1, the agent will evaluate its actions based on the sum of all its future rewards. To learn the sequence of actions that lead to the maximum discounted sum of future rewards, an agent estimates optimal values for all possible actions in a given state. These estimated values are defined by the expected sum of future rewards under a given policy π.
where E π is the expectation under the policy π, and Q π (s, a) is the expected sum of discounted rewards when the action a is chosen by the agent in the state s under a policy π. Q-learning Watkins and Dayan (1992) is a widely used reinforcement learning algorithm that enables the agent to update its Q π (s, a) estimation iteratively by using the following formula: where α is the learning rate, and Q π (s t+1 , a) is the future value estimate. By iteratively updating the Q values based on the agent's experience, the Q function can be converged to the optimal Q function, which satisfies the following Bellman optimality equation: where π * is the optimal policy. Action a can be determined as follows: a argmax a Q * (s, a) When the state space and the action space are discrete and finite, the Q function can be a table that contains all possible stateaction values. However, when the state and action spaces are large or continuous, a neural network is commonly used as a Q-function approximator Mnih et al. (2015); Lillicrap et al. (2015). In this work, we model a reinforcement learning agent which uses a fully connected DNN as a Q-function approximator to select one of the four circles.

Recurrent Neural Network
In this study, we used the population clock model for training the RL agent to learn the representation of time. In previous studies, this model has been shown to robustly learn and generate simpleto-complex temporal patterns Laje and Buonomano (2013); Hardy et al. (2018). The population clock model (i.e., RNN) contains a pool of recurrently connected nonlinear firing rate neurons with random initial weights as shown at the top of Figure 2. To achieve "time-to-act" and temporal scaling of timing behavior, we trained the weights of both recurrent neurons and output neurons. The network we used in this study contained 300 recurrent neurons, as indicated by the blue neurons inside the green circle, plus one input and one output neuron. The dynamics of the network Sompolinsky et al. (1988) are governed by Eqs 6-8. The learning showed a similar performance on a larger number of neurons, and the performance started to decline when 200 neurons were used.
Given a network that contains N recurrent neurons, fr i represents the firing of the i th [1, 2..., N ] recurrent neuron. W Rec , which is an NxN weight matrix, defines the connectivity of the recurrent neurons, which is initialized randomly from a normal distribution with a mean of 0 and a standard deviation of 1/ g p N , where g represents the gain of the network. Each input neuron is connected to every recurrent neuron in the network with a W ln , which is an Nx1 input weight matrix. W ln is initialized randomly from a normal distribution with a mean of 0 and a standard deviation of 1 and is fixed during training. Similarly, every recurrent neuron is connected to each output neuron with a W out , which is a 1xN output weight matrix. In this study, we trained W Rec and W out using a reward-based recursive least squares method. The variable y represents the activity level of the input neurons (states), and z represents the output. x i (t) represents the state of the i th recurrent neuron, which is initially zero, and τ is the neuron time constant.
Initially, due to the high gain caused by W Rec (when g 1.6), the network produces chaotic dynamics, which in theory can encode time for a long time Hardy et al. (2018). In practice, the recurrent weights need to be tuned to reduce this chaos and locally stabilize the output activity. The parameters, such as connection probability, Δt, g (gain of the network), and τ, were chosen based on the existing population clock model research Buonomano and Maass (2009) ;Laje and Buonomano (2013). In this work, we trained both recurrent and output weights using a reward-based recursive least square algorithm. During an episode, the agent chooses to act when the output activity exceeds a threshold (in this study, 0.5). We experimented with other threshold values between 0.4 and 1, but each produced similar results to 0.5. If the activity never exceeds a threshold, then the agent chooses a random time point to act. This is to ensure that the agent tries different time points and acts before it learns the temporal nature of the task. As illustrated in Figure 2 (left side), a sequence of state inputs are given to an agent during an episode lasting 3,600 ms, where each state for the RNN network is a 20ms input signal and a single value for the DQN. The agent receives state s1 at 0ms. At this point, all circles are active. At 900ms, the first circle turns inactive, and the agent receives state s 2 . In other words, the agent only receives the next state after the previous state has changed. In this case, the changes are caused by the circle turning inactive due to time constraints preset in the task. The final state, s5, is a terminal state where all the circles are inactive. Note that each action given by the Q network is only executed at the time points defined by the RNN network.

Time and Action Co-Training in Reinforcement Learning Agent
At the start of an episode, an agent explores the environment by selecting random circles to click. At the end of the episode, the agent collects a set of different experience tuples (s t , a t , r t+δt , s t+δt ) that are used to train the DQN and RNN.

DQN
The parameters of the Q network θ are iteratively updated using Eqs. 9, 10 for action a t taken in state s t , which results in reward r t+δt . θ t+1 θ t + α y − Q(s t , a t ; θ t ) ∇ θt Q(a t , a t ; θ t ) y r t+1 + c max t Q(s t+1 , a; θ t )

Recurrent Neural Network
In the RNN, both the recurrent weights and output weights were updated at every Δt 10ms, using the collected experiences. The recursive least square algorithm (RLS) Åström and Wittenmark (2013) is a basic recursive application of the least square algorithm. Given an input signal x 1 , x 2 , . . . .x n and the set of desired responses y 1 , y 2 , . . . .y n , the RLS updates the parameters W Rec and W Out to minimize the mean difference between the desired and the actual output of the RNN (which is the firing rate fr i of the recurrent neuron). In the proposed architecture, we generate the desired response of recurrent neurons by adding a reward to the firing rate fr i (t) neuron i at time t such that the desired firing rate decreases at time t if r t < 0 and increases if r t > 0. The desired response of output neurons was generated by adding a reward to output activity z, as defined in Eq 7.
The error e rec i (t) of recurrent neurons is computed using Eq 12, where fr i (t) is the firing rate of neuron i at time t, and r t is the reward received at time t. The desired signal r i (t) + reward(t) is clipped between R min and R max due to the high variance of the firing rate. The update of parameters W Rec is dictated by Eq 11, where W Rec ij is the recurrent weight between the i th neuron and the j th neuron. The exact values of Z min , Z max , R min andR max are shown in Table 1. Z min and Z max act as clamping values of the desired output activity. So, in this study, the value of Z max was chosen to be close to the positive threshold (+0.5), and the value of Z min was chosen to be close to the negative threshold (−0.5). The parameter Δt was set based on the existing population clock model research Buonomano and Maass (2009) ;Laje and Buonomano (2013).
In this study, we trained only a subset of recurrent neurons, which were randomly selected at the start of training. SubRec is a subset of randomly selected neurons from the population. For the experiments in this study, we selected 30% of the recurrent neurons for training. The square matrix P i governs the learning rate of the recurrent neuron i, which is updated at every Δt using Eq 13.
The output weights W Out ij (weight between recurrent neuron j and output neuron i) are also updated in a similar way; the error is calculated using Eq 14 as follows: e out j (t) z(t) − max(Z min , min((z(t) + reward(t)), Z max )) (14)

Different Scenarios
To understand the proficiency of this model, we trained and tested the agent on multiple different scenarios with different time intervals and different numbers of circles. We observed that the agent learned to produce a neural trajectory that peaked at the time-to-act intervals with near-perfect accuracy. Figure 3 demonstrates the learned neural trajectory of a few of the scenarios we trained. The colored bars in Figure 3 indicate the correct time-to-act interval. The proposed RNN training method exhibited some notable behavioral features, such as the following: 1) the agent learned to subdue its activity as soon as it observed a new state, analogous to restarting a clock, and 2) depending on the observed state, the Frontiers in Control Engineering | www.frontiersin.org August 2021 | Volume 2 | Article 722092 agent learned to ramp its activity to peak at the time-to-act. We also observed that the agent could learn to do the same without training the recurrent weights (i.e., by only training the output weights W Out ). However, by training a percentage of the recurrent neurons, we observed that the agent could learn to produce the desired activity in relatively fewer episodes of training.

Temporal Scaling
It is interesting how humans can execute their actions, such as speaking, writing, or playing music at different speeds. Temporal scaling is another feature we observed in our proposed method. A few studies have explored temporal scaling in humans Diedrichsen et al. (2007); Collier and Wright (1995), particularly the study by Hardy et al. (2018), which modeled temporal scaling using an RNN and a supervised learning method. Their approach involved training recurrent neurons using a second RNN that generates a target output for each of the recurrent neurons in the population. Unfortunately, this approach is not feasible with an online learning algorithm such as reinforcement learning. So, to explore the possibility of temporal scaling with our method, we trained the model using an additional speed input (shown in Figure 4), using the same approach as is outlined in Eqs. 11, 12, 14. In this set-up, the RNN receives both a state input and a speed input. The speed input is a constant value given only when there is a state input; for the rest of the time, the speed input is zero. We trained the model only with one speed (speed 1) and tested it at three different speeds: speed 1.3, speed 0.01, and speed 0.8. Figure 5 shows the results. We observed that the shift in click time with respect to speed could be defined using Eq 15. We used a similar procedure to that described in Section 2.3.2 to train for temporal scaling.

Learning to Plan Multiple Future Times-to-Act
One of the inherent properties of an RNN is that it can produce multiple peaks at different time points, even with only one input at the start of the trial. Results of the study by Hardy et al. (2018) showed that the output of the RNN (trained using supervised learning) peaked at multiple time points given a single input of 250 ms at the start of the trial. To understand whether an agent could learn to plan such multiple future times-to-act given one state using the proposed training, we trained an agent on a slightly modified task-switching scenario. Here, the agent needed to click on the first circle at three different time intervals, 400-500 ms, 1,000-1,100 ms, and 1,700-1,800 ms, and on the second circle at 2,300-2,400 ms. The first circle was set to deactivate at 1,801 ms. At the first state s1, the agent learned to produce a neural trajectory that peaked at three intervals, followed by state s2, which peaked at 2,300-2,400 ms, as shown in Figure 6.

Skip State Test
As seen in experiment-3, the multiple peaks (multiple times-toact) that the agent was producing could be based on its inherent property of the RNN. In reinforcement learning, however, the peak at the time-to-act should be truly dependent on each input state and also leverage the temporal properties of the RNN.
Hence, to evaluate whether the learned network was truly dependent on the state, we tested it by skipping one of the input states. As Figure 7 shows, when the agent did not receive a state at 2,400 milliseconds, it did not choose to act during the 3,200-3,300 interval, proving that the learned time-toact is truly state dependent.

Task Switching With 20 Tasks
To investigate the scalability of the proposed method to a relatively large state space, we trained and tested the model in Frontiers in Control Engineering | www.frontiersin.org August 2021 | Volume 2 | Article 722092 10 a scenario consisting of 20 circles with 20 different times-to-act. Figure 8 demonstrates that the agent could indeed still learn the time-to-act with near-perfect accuracy.

Memory Task
From the above experiments, the agent was able to learn and employ its time representation in multiple ways. However, we are also interested to know for how long an agent can remember a given input. To investigate this, we delayed the time-to-act for 2,000 ms after the offset of the input and trained the agent. The trained agent remembered a state seen at 0-20 ms until 2,000 ms (see Figure 9), which is indicated by the peak in the output activity. We also trained the agent to remember a state at 3,000 ms. With the current amount of recurrent neurons (i.e., 300 neurons), the agent was not able to remember for 3,000 ms from the offset of an input.

Shooting a Moving Target
Similar to the task-switching experiment, we trained the RL agent to learn "when to act" on a different scenario. In this scenario, the agent is rewarded for shooting a moving target. The target is the blob of a moving damped pendulum. The length of the pendulum is 1 m, and the weight of the blob is 1 kg. We trained the DQN to select the direction of shooting and the RNN to learn the exact time to release the trigger. The agent was rewarded positively for hitting the blob with an error of 0.1 m and negatively if the agent missed the target. The learned activity is shown in Figure 10; the left shows the motion of the pendulum and the right shows the learned RNN activity. The threshold in this experiment was 0.05, and the agent was able to hit the blob 5 times in 3,000 ms. Although it is still not clear why the agent did not peak its activity from 0 to 1,500 ms, the agent showed better performance after 1,500 ms.

COMPARISON WITH LONG SHORT-TERM MEMORY (LSTM) NETWORK
A recent study by Deverett et al. (2019) investigated the interval timing abilities in a reinforcement learning agent. In the study, an  RL agent was trained to reproduce a given temporal interval. However, the time representation in the study was in the form of movement (or velocity) control. In other words, the agent had to move from one point to the goal point within the same interval as presented at the start of the experiment. The agent which used the LSTM network in this study by Deverett et al. (2019) performed the task with near-perfect accuracy, indicating the ability to learn temporal properties using LSTM networks. Following these findings, our study endeavors to understand if an agent can learn a direct representation of time (instead of an indirect representation of time, such as velocity or acceleration) using LSTM. In order to investigate in this direction, we trained an RL agent with only one LSTM network as its DQN network (no RNN was used in this test) on the same task-switching scenario. The input sequence for an RNN works in terms of dt (as shown in Eq 6), whereas input for LSTM works in terms of sequence length, as shown in Figure 11. For example, an input signal with a length of 3,000 ms can be given as 1 ms at a time to an RNN, and for LSTM, the same input should be divided into a fixed length to effectively FIGURE 12 | Output activity of the trained LSTM network for a task-switching scenario containing four circles, with time-to-act intervals shown in colored bars.
Frontiers in Control Engineering | www.frontiersin.org August 2021 | Volume 2 | Article 722092 capture the temporal properties in the input. We used an LSTM with 100 input nodes and gave an input signal of 100 ms to the network, followed by the next 100 ms. Indeed, the sequence length can be smaller than 100 ms. In our experiments, we trained the agent with different sequence lengths (50, 100, 200, and 300 ms), and the agent showed better performance for 300 ms (results for 50, 100, and 200 ms are given in the Appendix). The architecture of the LSTM we used contained one LSTM layer with 256 hidden units, 300 input nodes, and two linear layers with 100 nodes each. The output size of the network was 300, which resulted in an activity of n points for a given input signal of n ms. The hidden states of the LSTM network were carried on throughout the episode. The trained activity of the LSTM network is shown in Figure 12 (bottom), where the light blue region shows the output activity of the network. The colored bars in Figure 12 show the output activity of the LSTM network and the correct time-to-act intervals for clicking each circle. The LSTM network did learn to exceed the threshold indicating when to act at a few time-to-act intervals. However, there is periodicity learned by the network, meaning that for every 300 ms, the network learned to produce similar activity.

DISCUSSION
In this study, we trained a reinforcement learning agent to learn "when to act" using an RNN and "what to act" using a DQN. We introduced a reward-based recursive least square algorithm to train the RNN. By disentangling the process of learning the temporal and spatial aspects of action into independent tasks, we intend to understand explicit time representation in an RL agent. Through this strategy, the agent learned to create its representation of time. Our experiments, which employed a peak-interval style, show that the agent could learn to produce a neural trajectory that peaked at the time-to-act with near-perfect accuracy. We also observed several other intriguing behaviors.
• The agent learned to subdue its activity immediately after observing a new state. We interpreted this as the agent restarting its clock. • The agent was able to temporally scale its actions in our proposed learning method. Even though we trained the agent with a single-speed value (speed 1), it learned to temporally scale its action to speeds that were both lower (speed 0.01) and higher (speed 1.3) than the trained speed. Notably, the agent was not able to scale its actions beyond speed 1.3. • We observed that neural networks such as the LSTM might not be able to learn an explicit representation of time when compared with population clock models. Deverett et al. (2019) showed that an RL agent can scale its actions (increase or decrease the velocity) using the LSTM network. However, when we trained the LSTM network to learn a direct representation of the time, it learned periodic activity. • In this research study, we trained an RL agent in a similar environment to task switching; shooting a moving target. The target in our experiment is a blob of a damped pendulum with a length of 1 m and a mass of 1 kg. The agent was able to shoot the fast-moving blob by learning to shoot at a few near-accurate time points.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

AUTHOR CONTRIBUTIONS
All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.