<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Control. Eng.</journal-id>
<journal-title>Frontiers in Control Engineering</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Control. Eng.</abbrev-journal-title>
<issn pub-type="epub">2673-6268</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">722092</article-id>
<article-id pub-id-type="doi">10.3389/fcteg.2021.722092</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Control Engineering</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Time and Action Co-Training in Reinforcement Learning Agents</article-title>
<alt-title alt-title-type="left-running-head">Akella and Lin</alt-title>
<alt-title alt-title-type="right-running-head">Time and Action Co-Training</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Akella</surname>
<given-names>Ashlesha</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/1363996/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Lin</surname>
<given-names>Chin-Teng</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/75218/overview"/>
</contrib>
</contrib-group>
<aff>Faculty of Engineering and Information Technology (FEIT), School of Computer Science, Australian Artificial Intelligence Institute, University of Technology Sydney, <addr-line>Sydney</addr-line>, <addr-line>NSW</addr-line>, <country>Australia</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1003547/overview">Qin Wang</ext-link>, Yangzhou University, China</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1004005/overview">Peng Liu</ext-link>, North University of China, China</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1379063/overview">Tianhong Liu</ext-link>, Yangzhou University, China</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Chin-Teng Lin, <email>chin-teng.lin@uts.edu.au</email>
</corresp>
<fn fn-type="other">
<p>This article was submitted to Nonlinear Control, a section of the journal Frontiers in Control Engineering</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>06</day>
<month>08</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>2</volume>
<elocation-id>722092</elocation-id>
<history>
<date date-type="received">
<day>08</day>
<month>06</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>12</day>
<month>07</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2021 Akella and Lin.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Akella and Lin</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these&#x20;terms.</p>
</license>
</permissions>
<abstract>
<p>In formation control, a robot (or an agent) learns to align itself in a particular spatial alignment. However, in a few scenarios, it is also vital to learn temporal alignment along with spatial alignment. An effective control system encompasses flexibility, precision, and timeliness. Existing reinforcement learning algorithms excel at learning to select an action given a state. However, executing an optimal action at an appropriate time remains challenging. Building a reinforcement learning agent which can learn an optimal time to act along with an optimal action can address this challenge. Neural networks in which timing relies on dynamic changes in the activity of population neurons have been shown to be a more effective representation of time. In this work, we trained a reinforcement learning agent to create its representation of time using a neural network with a population of recurrently connected nonlinear firing rate neurons. Trained using a reward-based recursive least square algorithm, the agent learned to produce a neural trajectory that peaks at the &#x201c;time-to-act&#x201d;; thus, it learns &#x201c;when&#x201d; to act. A few control system applications also require the agent to temporally scale its action. We trained the agent so that it could temporally scale its action for different speed inputs. Furthermore, given one state, the agent could learn to plan multiple future actions, that is, multiple times to act without needing to observe a new&#x20;state.</p>
</abstract>
<kwd-group>
<kwd>reinforcement learning</kwd>
<kwd>recurrent neural network</kwd>
<kwd>time perception</kwd>
<kwd>formation control</kwd>
<kwd>temporal scaling</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>A powerful formation control system requires continuously monitoring the current state, comparing the performance, and deciding whether to take necessary actions. This process does not only need to understand the system&#x2019;s state and optimal actions but also needs to learn the appropriate time to perform an action. Deep reinforcement learning algorithms which have achieved remarkable success in the field of robotics, games, and board games have also been shown to perform well in adaptive control system problems <xref ref-type="bibr" rid="B16">Li et&#x20;al. (2019)</xref>; <xref ref-type="bibr" rid="B22">Oh et&#x20;al. (2015)</xref>; <xref ref-type="bibr" rid="B30">Xue et&#x20;al. (2013)</xref>. However, the challenge of learning the precise time to act has not been directly addressed.</p>
<p>The ability to measure time from the start of a state change and use it accordingly is an essential part of applications such as adaptive control systems. In general, the environment encodes as four dimensions: the three dimensions of space and the dimension. The role of representation of time affects the decision-making process along with the spatial aspects of the environment <xref ref-type="bibr" rid="B14">Klapproth (2008)</xref>. However, in the field of reinforcement learning (RL), the essential role of time is not explicitly acknowledged, and existing RL research mainly focuses on the spatial dimensions. The lack of time sense might not be an issue when considering a simple behavioral task, but many tasks in control systems require precisely timed actions for which an artificial agent is required to learn the representation of time and experience the passage of&#x20;time.</p>
<p>Research on time representation has yielded several different supervised learning models such as the ramping firing rate <xref ref-type="bibr" rid="B11">Durstewitz (2003)</xref>, multiple oscillator models <xref ref-type="bibr" rid="B18">Matell et&#x20;al. (2003)</xref>; <xref ref-type="bibr" rid="B19">Miall (1989)</xref>, diffusion models <xref ref-type="bibr" rid="B25">Simen et&#x20;al. (2011)</xref>, and the population clock model <xref ref-type="bibr" rid="B3">Buonomano and Laje (2011)</xref>. In some of these models, such as the two presented in the studies by <xref ref-type="bibr" rid="B12">Hardy et&#x20;al. (2018)</xref> and <xref ref-type="bibr" rid="B15">Laje and Buonomano (2013)</xref>, timing relies on dynamic changes in the activity patterns of neuron populations. More specifically, it relies on nonlinear firing rate neurons connected recurrently, and research has shown that these models are the most effective <xref ref-type="bibr" rid="B3">Buonomano and Laje (2011)</xref> and the best at accounting for timing and temporal scaling compared to other available models. Extending this work on a rote sense of time for agents, we used a population clock model recurrent neural network (RNN) consisting of nonlinear firing rate neurons as our timing module and trained a reinforcement learning agent to create its own representation of&#x20;time.</p>
<p>It is arguable that a traditional artificial neural network, such as a multilayer perceptron, which was proven to learn complex spatial patterns, could also be used to learn time representation. However, these networks might not be well suited to perform a simple interval-discrimination task, due to the lack of the implicit representation of time <xref ref-type="bibr" rid="B4">Buonomano and Maass (2009)</xref>. One argument is that a traditional artificial neural network processes inputs and outputs as a static spatial pattern. However, to achieve an effective control system, the agent needs to continuously process the state of the system. For instance, if we want an agent to process continuous-time input, such as a video in a game, we divide the input into multiple time-bins. Similarly, deep neural network (DNN) models with long short-term memory (LSTM) units <xref ref-type="bibr" rid="B13">Hochreiter and Schmidhuber (1997)</xref> or gated recurrent units (GRUs) <xref ref-type="bibr" rid="B6">Chung et&#x20;al. (2014)</xref> can implicitly represent time by allowing the state of the previous time step to interact with the state of the current time step. These networks still treat time as a spatial dimension because they expect the input to be discretised into multiple time bins <xref ref-type="bibr" rid="B2">Bakker (2002)</xref> <xref ref-type="bibr" rid="B4">Buonomano and Maass (2009)</xref>. Because these networks treat time as a spatial dimension, they might lack explicit time representation.</p>
<p>Through the lens of RL algorithms, the problem of discretising input into multiple time bins can be explained as follows. Given the current state of the environment <inline-formula id="inf1">
<mml:math id="m1">
<mml:mrow>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, a DNN function approximator (for example, a policy network) outputs an action at <inline-formula id="inf2">
<mml:math id="m2">
<mml:mrow>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> at every time step <italic>t</italic>. If an action <inline-formula id="inf3">
<mml:math id="m3">
<mml:mrow>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is more valuable when executed at time <inline-formula id="inf4">
<mml:math id="m4">
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b4;</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> or <inline-formula id="inf5">
<mml:math id="m5">
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>&#x3b4;</mml:mi>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, then to effectively maximize the summation of future rewards, we should further divide the input into smaller time steps. By dividing these time steps more finely, an agent could learn the true value of the state, although at the expense of a higher computation cost and with increased state value variance <xref ref-type="bibr" rid="B23">Petter et&#x20;al. (2018)</xref>. A few studies <xref ref-type="bibr" rid="B5">Carrara et&#x20;al. (2019)</xref>; <xref ref-type="bibr" rid="B27">Tallec et&#x20;al. (2019)</xref>; <xref ref-type="bibr" rid="B10">Doya (2000)</xref> have elegantly extended reinforcement learning algorithms to continuous time and state space, which generalizes the value function approximators over time. However, if an agent has developed a representation of the time, it could learn to explicitly encode the optimal time intervals itself and in turn, learn to decide when to act. In this study, we present the model of how the time representation is learned and the subsequent encoding process could take&#x20;place.</p>
<p>In this research, we have developed a new scenario called &#x201c;task switching,&#x201d; where an agent is presented with multiple circles to click (task), and each circle should be clicked within a specific time window in a specific order. This scenario attempts to encapsulate both spatial and timing decisions. This task was built analogous to a multi-input multi-output (MIMO) system in process control tasks, where the system should compare the state of the current system and decide when making parameter changes to the system.</p>
<p>This research aims to investigate the co-learning of decision making and development of timing by an artificial agent using a reinforcement learning framework. We achieve this by disentangling the process of learning optimal action (which circle to click) and time representation (when to click a circle). We designed a novel architecture that contains two modules: 1. a timing module that uses a population clock model, a recurrent neural network (RNN) consisting of nonlinear firing rate neurons, and 2. an action module that employs a deep Q-network (DQN) <xref ref-type="bibr" rid="B21">Mnih et&#x20;al. (2015)</xref> to learn the optimal action given a specific state. The RNN and DQN are co-trained to learn the time to act and action. The RNN was trained using a reward-based recursive least squares algorithm, and the DQN was trained using the Bellman equation. The results of a series of task-switching scenarios show that the agent learned to produce a neural trajectory reflecting its own sense of time that peaked at the correct time-to-act. Furthermore, the agent was able to temporally scale its time-to-act more quickly or more slowly according to the input speed. We also compared the performance of the proposed architecture with DNN models such as LSTM, which can implicitly represent time. We observed that for tasks involving precisely timed action, neural network models such as the population clock model perform better than the&#x20;LSTM.</p>
<p>This article first presents the task-switching scenario and describes the proposed architecture and training methodology used in the work. <xref ref-type="sec" rid="s3">Section 3</xref> presents the performance of the trained RL agent on six different experiments. In <xref ref-type="sec" rid="s4">Section 4</xref>, we present the performance of LSTM in comparison with the proposed model. Finally, <xref ref-type="sec" rid="s5">Section 5</xref> presents an extensive discussion about the learned time representation with respect to prior electrophysiology studies.</p>
</sec>
<sec id="s2">
<title>2 Methods</title>
<sec id="s2-1">
<title>2.1&#x20;Task-Switching Scenario</title>
<p>In the scenario, there are <italic>n</italic> different circles, and the agent must learn to click on each circle within a specific time interval and in a specific order. This task involves learning to decide which circle to click and when that circle should be clicked. <xref ref-type="fig" rid="F1">Figure&#x20;1</xref> shows an example scenario with four circles. Circle 1 must be clicked at some point between 800 and 900&#xa0;ms. Similarly, circles 2, 3, and 4 must be clicked at 1,500&#x2013;1,600, 2,300&#x2013;2,400, and 3,300&#x2013;3,400&#xa0;ms, respectively. If the agent clicks the correct circle in the correct time period, it receives a positive reward. If it clicks a circle at the incorrect time, it receives a negative reward (refer to <xref ref-type="table" rid="T1">Table&#x20;1</xref> for the exact reward values). Each circle becomes inactive once its time interval has passed. For example, circle 1 in <xref ref-type="fig" rid="F1">Figure&#x20;1</xref> becomes inactive at 901&#xa0;ms, meaning that the agent cannot click it after 900&#xa0;ms and receives a reward of 0 if it attempts to click the inactive circle. Each circle can only be clicked once during an episode.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>Task-switching scenario with four circles. Circle 1 (in blue) must be clicked at a time point between 800 and 900&#xa0;ms from the start of the experiment. Circles 2 (in green), 3 (in orange), and 4 (in yellow) must be clicked in the 2,300&#x2013;2,400, 3,300&#x2013;3,400, and 1,500&#x2013;1,600&#xa0;ms intervals, respectively.</p>
</caption>
<graphic xlink:href="fcteg-02-722092-g001.tif"/>
</fig>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Model parameters.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Parameter</th>
<th align="left">Values</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Number of recurrent neurons</td>
<td align="center">300</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf6">
<mml:math id="m6">
<mml:mrow>
<mml:mtext>&#x394;</mml:mtext>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="center">10&#xa0;ms</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf7">
<mml:math id="m7">
<mml:mrow>
<mml:msub>
<mml:mi>Z</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="center">&#x2212;0.3</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf8">
<mml:math id="m8">
<mml:mrow>
<mml:msub>
<mml:mi>Z</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="center">0.6</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf9">
<mml:math id="m9">
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="center">&#x2212;0.4</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf10">
<mml:math id="m10">
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="center">0.6</td>
</tr>
<tr>
<td align="left">
<italic>P</italic>
</td>
<td align="center">
<inline-formula id="inf11">
<mml:math id="m11">
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mo>&#x2217;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mi>e</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
</tr>
<tr>
<td align="left">Positive reward</td>
<td align="center">3</td>
</tr>
<tr>
<td align="left">Negative reward</td>
<td align="center">&#x2212;0.05</td>
</tr>
<tr>
<td align="left">Recurrent neuron connection probability</td>
<td align="center">0.2</td>
</tr>
<tr>
<td align="left">g (gain of the network)</td>
<td align="center">1.6</td>
</tr>
<tr>
<td align="left">
<italic>&#x03C4;</italic>
</td>
<td align="center">25&#xa0;ms</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The same scenario was modified to conduct the following experiments:<list list-type="simple">
<list-item>
<p>&#x2022; Co-training time and action in a reinforcement learning agent on a simple task-switching scenario.</p>
</list-item>
<list-item>
<p>&#x2022; Temporal scaling: the time intervals of each circle occur at different speeds. For instance, at Speed 2, circle 1 in <xref ref-type="fig" rid="F1">Figure&#x20;1</xref> must be clicked between 750 and 850&#xa0;ms; similarly, circles 2, 3, and 4 must be clicked at 1,450&#x2013;1,550, 2,250&#x2013;2,350, and 3,250&#x2013;3,350&#xa0;ms, respectively.</p>
</list-item>
<list-item>
<p>&#x2022; Multiple clicks: one circle should be clicked multiple times without any external cue. For instance, after circle 1 is clicked and without any further stimulus input, the agent should learn to click the same circle after a fixed time interval.</p>
</list-item>
<list-item>
<p>&#x2022; Twenty circles: To understand if the agent can handle a large number of tasks, we trained the agent on a scenario containing 20 circles.</p>
</list-item>
<list-item>
<p>&#x2022; Skip state: in the task-switching scenario, the learned time-to-act should be a state-dependent action. In other words, when the state input is eliminated, the agent should not perform an action. For instance, if circle 4 in <xref ref-type="fig" rid="F1">Figure&#x20;1</xref> is removed from the state input, the agent should skip clicking on circle&#x20;4.</p>
</list-item>
</list>
</p>
</sec>
<sec id="s2-2">
<title>2.2 Framework</title>
<p>To disentangle the learning of temporal and spatial aspects of the action space, the temporal aspect being when to act and the spatial being what to act on, we used two different networks: a DQN to learn which action to take and an RNN which learns to produce a neural trajectory that peaks at the time-to-act.</p>
<sec id="s2-2-1">
<title>2.2.1 Deep Q-Network</title>
<p>In recent years, RL algorithms have given rise to tremendous achievements <xref ref-type="bibr" rid="B28">Vinyals et&#x20;al. (2019)</xref>; <xref ref-type="bibr" rid="B20">Mnih et&#x20;al. (2013)</xref>; <xref ref-type="bibr" rid="B24">Silver et&#x20;al. (2017)</xref>. RL manifests as a Markov decision process (MDP) defined by the state space <inline-formula id="inf12">
<mml:math id="m12">
<mml:mi mathvariant="script">S</mml:mi>
</mml:math>
</inline-formula>, the action space <inline-formula id="inf13">
<mml:math id="m13">
<mml:mi mathvariant="script">A</mml:mi>
</mml:math>
</inline-formula>, and the reward function <inline-formula id="inf14">
<mml:math id="m14">
<mml:mrow>
<mml:mi mathvariant="normal">&#x211b;</mml:mi>
<mml:mo>:</mml:mo>
<mml:mi mathvariant="script">S</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo>&#x2192;</mml:mo>
<mml:mi mathvariant="normal">&#x211d;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. At any given time step <italic>t</italic>, the agent receives a state <inline-formula id="inf15">
<mml:math id="m15">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="script">S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, which it uses to select an action <inline-formula id="inf16">
<mml:math id="m16">
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and execute that action on the environment. Next, the agent receives a reward <inline-formula id="inf17">
<mml:math id="m17">
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b4;</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="normal">&#x211b;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, and the environment changes from state <inline-formula id="inf18">
<mml:math id="m18">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> to <inline-formula id="inf19">
<mml:math id="m19">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b4;</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="script">S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. For each action the agent performs on the environment, it collects <inline-formula id="inf20">
<mml:math id="m20">
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b4;</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b4;</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, also called an experience tuple. An agent learns to take actions that maximize the accumulated future rewards, which can be expressed as <inline-formula id="inf21">
<mml:math id="m21">
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> as follows:<disp-formula id="e1">
<mml:math id="m22">
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>&#x221e;</mml:mi>
</mml:munderover>
<mml:mrow>
<mml:msup>
<mml:mi>&#x3b3;</mml:mi>
<mml:mi>t</mml:mi>
</mml:msup>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:math>
<label>(1)</label>
</disp-formula>where <inline-formula id="inf22">
<mml:math id="m23">
<mml:mrow>
<mml:mi>&#x3b3;</mml:mi>
<mml:mi mathvariant="italic">&#x3f5;</mml:mi>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mn>0,1</mml:mn>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the discount factor that determines the importance of the immediate reward and the future reward. If <inline-formula id="inf23">
<mml:math id="m24">
<mml:mrow>
<mml:mi>&#x3b3;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, the agent will learn to choose actions that produce an immediate reward. If <inline-formula id="inf24">
<mml:math id="m25">
<mml:mrow>
<mml:mi>&#x3b3;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, the agent will evaluate its actions based on the sum of all its future rewards. To learn the sequence of actions that lead to the maximum discounted sum of future rewards, an agent estimates optimal values for all possible actions in a given state. These estimated values are defined by the expected sum of future rewards under a given policy &#x3c0;.<disp-formula id="e2">
<mml:math id="m26">
<mml:mrow>
<mml:msup>
<mml:mi>Q</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mi>E</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(2)</label>
</disp-formula>where <inline-formula id="inf25">
<mml:math id="m27">
<mml:mrow>
<mml:msub>
<mml:mi>E</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the expectation under the policy &#x3c0;, and <inline-formula id="inf26">
<mml:math id="m28">
<mml:mrow>
<mml:msup>
<mml:mi>Q</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the expected sum of discounted rewards when the action <italic>a</italic> is chosen by the agent in the state <italic>s</italic> under a policy &#x3c0;. Q-learning <xref ref-type="bibr" rid="B29">Watkins and Dayan (1992)</xref> is a widely used reinforcement learning algorithm that enables the agent to update its <inline-formula id="inf27">
<mml:math id="m29">
<mml:mrow>
<mml:msup>
<mml:mi>Q</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> estimation iteratively by using the following formula:<disp-formula id="e3">
<mml:math id="m30">
<mml:mrow>
<mml:msup>
<mml:mi>Q</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mi>Q</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b1;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>&#x3b3;</mml:mi>
<mml:mi>max</mml:mi>
<mml:msup>
<mml:mi>Q</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b4;</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:msup>
<mml:mi>Q</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(3)</label>
</disp-formula>where &#x3b1; is the learning rate, and <inline-formula id="inf28">
<mml:math id="m31">
<mml:mrow>
<mml:msup>
<mml:mi>Q</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the future value estimate. By iteratively updating the Q values based on the agent&#x2019;s experience, the Q function can be converged to the optimal Q function, which satisfies the following Bellman optimality equation:<disp-formula id="e4">
<mml:math id="m32">
<mml:mrow>
<mml:msup>
<mml:mi>Q</mml:mi>
<mml:mrow>
<mml:msup>
<mml:mi>&#x3c0;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:msup>
</mml:mrow>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>E</mml:mi>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b3;</mml:mi>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mtext>max</mml:mtext>
<mml:msup>
<mml:mi>Q</mml:mi>
<mml:mrow>
<mml:msup>
<mml:mi>&#x3c0;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:msup>
</mml:mrow>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2032;</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(4)</label>
</disp-formula>where <inline-formula id="inf29">
<mml:math id="m33">
<mml:mrow>
<mml:msup>
<mml:mi>&#x3c0;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> is the optimal policy. Action <italic>a</italic> can be determined as follows:<disp-formula id="e5">
<mml:math id="m34">
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mtext>argmax</mml:mtext>
</mml:mrow>
<mml:mi>a</mml:mi>
</mml:msub>
<mml:msup>
<mml:mi>Q</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(5)</label>
</disp-formula>
</p>
<p>When the state space and the action space are discrete and finite, the Q function can be a table that contains all possible state-action values. However, when the state and action spaces are large or continuous, a neural network is commonly used as a Q-function approximator <xref ref-type="bibr" rid="B21">Mnih et&#x20;al. (2015)</xref>; <xref ref-type="bibr" rid="B17">Lillicrap et&#x20;al. (2015)</xref>. In this work, we model a reinforcement learning agent which uses a fully connected DNN as a Q-function approximator to select one of the four circles.</p>
</sec>
<sec id="s2-2-2">
<title>2.2.2 Recurrent Neural Network</title>
<p>In this study, we used the population clock model for training the RL agent to learn the representation of time. In previous studies, this model has been shown to robustly learn and generate simple-to-complex temporal patterns <xref ref-type="bibr" rid="B15">Laje and Buonomano (2013)</xref>; <xref ref-type="bibr" rid="B12">Hardy et&#x20;al. (2018)</xref>. The population clock model (i.e.,&#x20;RNN) contains a pool of recurrently connected nonlinear firing rate neurons with random initial weights as shown at the top of <xref ref-type="fig" rid="F2">Figure&#x20;2</xref>. To achieve &#x201c;time-to-act&#x201d; and temporal scaling of timing behavior, we trained the weights of both recurrent neurons and output neurons. The network we used in this study contained 300 recurrent neurons, as indicated by the blue neurons inside the green circle, plus one input and one output neuron. The dynamics of the network <xref ref-type="bibr" rid="B26">Sompolinsky et&#x20;al. (1988)</xref> are governed by <xref ref-type="disp-formula" rid="e6">Eqs 6</xref>&#x2013;<xref ref-type="disp-formula" rid="e8">8</xref>. The learning showed a similar performance on a larger number of neurons, and the performance started to decline when 200 neurons were used.<disp-formula id="e6">
<mml:math id="m35">
<mml:mrow>
<mml:mi>&#x3c4;</mml:mi>
<mml:mfrac>
<mml:mrow>
<mml:mtext>d</mml:mtext>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mtext>d</mml:mtext>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:munderover>
<mml:mrow>
<mml:msubsup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>Rec</mml:mtext>
</mml:mrow>
</mml:msubsup>
<mml:mi>f</mml:mi>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mstyle>
<mml:mo>&#x2b;</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>I</mml:mi>
</mml:munderover>
<mml:mrow>
<mml:msubsup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>In</mml:mtext>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:math>
<label>(6)</label>
</disp-formula>
<disp-formula id="e7">
<mml:math id="m36">
<mml:mrow>
<mml:mi>z</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:munderover>
<mml:mrow>
<mml:msubsup>
<mml:mi>W</mml:mi>
<mml:mi>j</mml:mi>
<mml:mrow>
<mml:mtext>Out</mml:mtext>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:math>
<label>(7)</label>
</disp-formula>
<disp-formula id="e8">
<mml:math id="m37">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>tanh</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(8)</label>
</disp-formula>
</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Proposed reinforcement learning architecture. <bold>(A)</bold> State input is received by the agent over an episode with a length of 3,600&#xa0;ms. The agent contains an RNN <bold>(B)</bold> and a deep Q-network <bold>(C)</bold>. The RNN receives a continuous input signal with state values for 20&#xa0;ms and zeros for the remaining time. The state values shown here are <inline-formula id="inf30">
<mml:math id="m38">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mn>1</mml:mn>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1.0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf31">
<mml:math id="m39">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mn>2</mml:mn>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1.5</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf32">
<mml:math id="m40">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mn>3</mml:mn>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>2.0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf33">
<mml:math id="m41">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mn>4</mml:mn>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>2.5</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, and <inline-formula id="inf34">
<mml:math id="m42">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mn>5</mml:mn>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>. The weights <inline-formula id="inf35">
<mml:math id="m43">
<mml:mrow>
<mml:msup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mtext>ln</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> (the orange connections) are initialized randomly and held constant throughout the experiment. The weights <inline-formula id="inf36">
<mml:math id="m44">
<mml:mrow>
<mml:msup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mtext>Rec</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf37">
<mml:math id="m45">
<mml:mrow>
<mml:msup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mi>o</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> (the blue connections) are initialized randomly and trained over the episodes. The DQN with one input and four output nodes receives the state value as its input and outputs the Q-value for each circle.</p>
</caption>
<graphic xlink:href="fcteg-02-722092-g002.tif"/>
</fig>
<p>Given a network that contains <italic>N</italic> recurrent neurons, <inline-formula id="inf38">
<mml:math id="m46">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> represents the firing of the <inline-formula id="inf39">
<mml:math id="m47">
<mml:mrow>
<mml:msup>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mn>1,2...</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>N</mml:mi>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> recurrent neuron. <inline-formula id="inf40">
<mml:math id="m48">
<mml:mrow>
<mml:msup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mtext>Rec</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>, which is an <inline-formula id="inf41">
<mml:math id="m49">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> weight matrix, defines the connectivity of the recurrent neurons, which is initialized randomly from a normal distribution with a mean of 0 and a standard deviation of <inline-formula id="inf42">
<mml:math id="m50">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mo>&#x2217;</mml:mo>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:math>
</inline-formula>, where <italic>g</italic> represents the gain of the network. Each input neuron is connected to every recurrent neuron in the network with a <inline-formula id="inf43">
<mml:math id="m51">
<mml:mrow>
<mml:msup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mtext>ln</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>, which is an <inline-formula id="inf44">
<mml:math id="m52">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mi>x</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> input weight matrix. <inline-formula id="inf45">
<mml:math id="m53">
<mml:mrow>
<mml:msup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mtext>ln</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> is initialized randomly from a normal distribution with a mean of 0 and a standard deviation of 1 and is fixed during training. Similarly, every recurrent neuron is connected to each output neuron with a <inline-formula id="inf46">
<mml:math id="m54">
<mml:mrow>
<mml:msup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mi>o</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>, which is a <inline-formula id="inf47">
<mml:math id="m55">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mi>x</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> output weight matrix. In this study, we trained <inline-formula id="inf48">
<mml:math id="m56">
<mml:mrow>
<mml:msup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mtext>Rec</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf49">
<mml:math id="m57">
<mml:mrow>
<mml:msup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mi>o</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> using a reward-based recursive least squares method. The variable <italic>y</italic> represents the activity level of the input neurons (states), and <italic>z</italic> represents the output. <inline-formula id="inf50">
<mml:math id="m58">
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> represents the state of the <inline-formula id="inf51">
<mml:math id="m59">
<mml:mrow>
<mml:msup>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> recurrent neuron, which is initially zero, and &#x3c4; is the neuron time constant.</p>
<p>Initially, due to the high gain caused by <inline-formula id="inf52">
<mml:math id="m60">
<mml:mrow>
<mml:msup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mtext>Rec</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> (when <inline-formula id="inf53">
<mml:math id="m61">
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1.6</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>), the network produces chaotic dynamics, which in theory can encode time for a long time <xref ref-type="bibr" rid="B12">Hardy et&#x20;al. (2018)</xref>. In practice, the recurrent weights need to be tuned to reduce this chaos and locally stabilize the output activity. The parameters, such as connection probability, <inline-formula id="inf54">
<mml:math id="m62">
<mml:mrow>
<mml:mtext>&#x394;</mml:mtext>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, g (gain of the network), and &#x3c4;, were chosen based on the existing population clock model research <xref ref-type="bibr" rid="B4">Buonomano and Maass (2009)</xref>; <xref ref-type="bibr" rid="B15">Laje and Buonomano (2013)</xref>. In this work, we trained both recurrent and output weights using a reward-based recursive least square algorithm. During an episode, the agent chooses to act when the output activity exceeds a threshold (in this study, 0.5). We experimented with other threshold values between 0.4 and 1, but each produced similar results to 0.5. If the activity never exceeds a threshold, then the agent chooses a random time point to act. This is to ensure that the agent tries different time points and acts before it learns the temporal nature of the&#x20;task.</p>
<p>As illustrated in <xref ref-type="fig" rid="F2">Figure&#x20;2</xref> (left side), a sequence of state inputs are given to an agent during an episode lasting 3,600&#xa0;ms, where each state for the RNN network is a 20-ms input signal and a single value for the DQN. The agent receives state <inline-formula id="inf55">
<mml:math id="m63">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> at <inline-formula id="inf56">
<mml:math id="m64">
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mi>m</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. At this point, all circles are active. At <inline-formula id="inf57">
<mml:math id="m65">
<mml:mrow>
<mml:mn>900</mml:mn>
<mml:mi>m</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, the first circle turns inactive, and the agent receives state <inline-formula id="inf58">
<mml:math id="m66">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. In other words, the agent only receives the next state after the previous state has changed. In this case, the changes are caused by the circle turning inactive due to time constraints preset in the task. The final state, <inline-formula id="inf59">
<mml:math id="m67">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, is a terminal state where all the circles are inactive. Note that each action given by the Q network is only executed at the time points defined by the RNN network.</p>
</sec>
</sec>
<sec id="s2-3">
<title>2.3 Time and Action Co-Training in Reinforcement Learning Agent</title>
<p>At the start of an episode, an agent explores the environment by selecting random circles to click. At the end of the episode, the agent collects a set of different experience tuples <inline-formula id="inf60">
<mml:math id="m68">
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b4;</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b4;</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> that are used to train the DQN and&#x20;RNN.</p>
<sec id="s2-3-1">
<title>2.3.1 DQN</title>
<p>The parameters of the Q network &#x3b8; are iteratively updated using <xref ref-type="disp-formula" rid="e9">Eqs. 9</xref>, <xref ref-type="disp-formula" rid="e10">10</xref> for action <inline-formula id="inf61">
<mml:math id="m69">
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> taken in state <inline-formula id="inf62">
<mml:math id="m70">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, which results in reward <inline-formula id="inf63">
<mml:math id="m71">
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b4;</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>.<disp-formula id="e9">
<mml:math id="m72">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b8;</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mi>&#x3b8;</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b1;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>y</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>Q</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>;</mml:mo>
<mml:msub>
<mml:mi>&#x3b8;</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:msub>
<mml:mo>&#x2207;</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b8;</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mi>Q</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>;</mml:mo>
<mml:msub>
<mml:mi>&#x3b8;</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(9)</label>
</disp-formula>
<disp-formula id="e10">
<mml:math id="m73">
<mml:mrow>
<mml:mi>y</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b3;</mml:mi>
<mml:munder>
<mml:mrow>
<mml:mtext>max</mml:mtext>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:munder>
<mml:mi>Q</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>;</mml:mo>
<mml:msub>
<mml:mi>&#x3b8;</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(10)</label>
</disp-formula>
</p>
</sec>
<sec id="s2-3-2">
<title>2.3.2 Recurrent Neural Network</title>
<p>In the RNN, both the recurrent weights and output weights were updated at every <inline-formula id="inf64">
<mml:math id="m74">
<mml:mrow>
<mml:mtext>&#x394;</mml:mtext>
<mml:mi>t</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>10</mml:mn>
<mml:mi>m</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, using the collected experiences. The recursive least square algorithm (RLS) <xref ref-type="bibr" rid="B1">&#xc5;str&#xf6;m and Wittenmark (2013)</xref> is a basic recursive application of the least square algorithm. Given an input signal <inline-formula id="inf65">
<mml:math id="m75">
<mml:mrow>
<mml:msub>
<mml:mtext>x</mml:mtext>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mtext>x</mml:mtext>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>.</mml:mo>
<mml:msub>
<mml:mtext>x</mml:mtext>
<mml:mtext>n</mml:mtext>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and the set of desired responses <inline-formula id="inf66">
<mml:math id="m76">
<mml:mrow>
<mml:msub>
<mml:mtext>y</mml:mtext>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mtext>y</mml:mtext>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>.</mml:mo>
<mml:msub>
<mml:mtext>y</mml:mtext>
<mml:mtext>n</mml:mtext>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, the RLS updates the parameters <inline-formula id="inf67">
<mml:math id="m77">
<mml:mrow>
<mml:msup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mtext>Rec</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf68">
<mml:math id="m78">
<mml:mrow>
<mml:msup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mtext>Out</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> to minimize the mean difference between the desired and the actual output of the RNN (which is the firing rate <inline-formula id="inf69">
<mml:math id="m79">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> of the recurrent neuron). In the proposed architecture, we generate the desired response of recurrent neurons by adding a reward to the firing rate <inline-formula id="inf70">
<mml:math id="m80">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> neuron <italic>i</italic> at time <italic>t</italic> such that the desired firing rate decreases at time <italic>t</italic> if <inline-formula id="inf71">
<mml:math id="m81">
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x3c;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> and increases if <inline-formula id="inf72">
<mml:math id="m82">
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x3e;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>. The desired response of output neurons was generated by adding a reward to output activity <italic>z</italic>, as defined in <xref ref-type="disp-formula" rid="e7">Eq&#x20;7</xref>.</p>
<p>The error <inline-formula id="inf73">
<mml:math id="m83">
<mml:mrow>
<mml:msubsup>
<mml:mi>e</mml:mi>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> of recurrent neurons is computed using <xref ref-type="disp-formula" rid="e12">Eq 12</xref>, where <inline-formula id="inf74">
<mml:math id="m84">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the firing rate of neuron <italic>i</italic> at time <italic>t</italic>, and <inline-formula id="inf75">
<mml:math id="m85">
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the reward received at time <italic>t</italic>. The desired signal <inline-formula id="inf76">
<mml:math id="m86">
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>w</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is clipped between <inline-formula id="inf77">
<mml:math id="m87">
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf78">
<mml:math id="m88">
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> due to the high variance of the firing rate. The update of parameters <inline-formula id="inf79">
<mml:math id="m89">
<mml:mrow>
<mml:msup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mtext>Rec</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> is dictated by <xref ref-type="disp-formula" rid="e11">Eq 11</xref>, where <inline-formula id="inf80">
<mml:math id="m90">
<mml:mrow>
<mml:msubsup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>Rec</mml:mtext>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> is the recurrent weight between the <inline-formula id="inf81">
<mml:math id="m91">
<mml:mrow>
<mml:msup>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> neuron and the <inline-formula id="inf82">
<mml:math id="m92">
<mml:mrow>
<mml:msup>
<mml:mi>j</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> neuron. The exact values of <inline-formula id="inf83">
<mml:math id="m93">
<mml:mrow>
<mml:msub>
<mml:mi>Z</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>Z</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>d</mml:mi>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> are shown in <xref ref-type="table" rid="T1">Table&#x20;1</xref>. <inline-formula id="inf84">
<mml:math id="m94">
<mml:mrow>
<mml:msub>
<mml:mi>Z</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf85">
<mml:math id="m95">
<mml:mrow>
<mml:msub>
<mml:mi>Z</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> act as clamping values of the desired output activity. So, in this study, the value of <inline-formula id="inf86">
<mml:math id="m96">
<mml:mrow>
<mml:msub>
<mml:mi>Z</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> was chosen to be close to the positive threshold (&#x2b;0.5), and the value of <inline-formula id="inf87">
<mml:math id="m97">
<mml:mrow>
<mml:msub>
<mml:mi>Z</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> was chosen to be close to the negative threshold (&#x2212;0.5). The parameter <inline-formula id="inf88">
<mml:math id="m98">
<mml:mrow>
<mml:mtext>&#x394;</mml:mtext>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> was set based on the existing population clock model research <xref ref-type="bibr" rid="B4">Buonomano and Maass (2009)</xref>; <xref ref-type="bibr" rid="B15">Laje and Buonomano (2013)</xref>.</p>
<p>In this study, we trained only a subset of recurrent neurons, which were randomly selected at the start of training. <inline-formula id="inf89">
<mml:math id="m99">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>R</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is a subset of randomly selected neurons from the population. For the experiments in this study, we selected <inline-formula id="inf90">
<mml:math id="m100">
<mml:mrow>
<mml:mn>30</mml:mn>
<mml:mtext>%</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula> of the recurrent neurons for training. The square matrix <inline-formula id="inf91">
<mml:math id="m101">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="bold-italic">P</mml:mi>
<mml:mi>i</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> governs the learning rate of the recurrent neuron <italic>i</italic>, which is updated at every <inline-formula id="inf92">
<mml:math id="m102">
<mml:mrow>
<mml:mtext>&#x394;</mml:mtext>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> using <xref ref-type="disp-formula" rid="e13">Eq 13</xref>.<disp-formula id="e11">
<mml:math id="m103">
<mml:mrow>
<mml:msubsup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>Rec</mml:mtext>
</mml:mrow>
</mml:msubsup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msubsup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>Rec</mml:mtext>
</mml:mrow>
</mml:msubsup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mtext>&#x394;</mml:mtext>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:msubsup>
<mml:mi>e</mml:mi>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:munder>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>S</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>R</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msubsup>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mi>i</mml:mi>
</mml:msubsup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mi>f</mml:mi>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(11)</label>
</disp-formula>
<disp-formula id="e12">
<mml:math id="m104">
<mml:mrow>
<mml:msubsup>
<mml:mi>e</mml:mi>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>f</mml:mi>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mi mathvariant="bold-italic">max</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(12)</label>
</disp-formula>
<disp-formula id="e13">
<mml:math id="m105">
<mml:mrow>
<mml:msup>
<mml:mi>P</mml:mi>
<mml:mi>i</mml:mi>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mi>P</mml:mi>
<mml:mi>i</mml:mi>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mtext>&#x394;</mml:mtext>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msup>
<mml:mi>P</mml:mi>
<mml:mi>i</mml:mi>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mtext>&#x394;</mml:mtext>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>r</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mi>f</mml:mi>
<mml:msup>
<mml:mi>r</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:msup>
<mml:mi>P</mml:mi>
<mml:mi>i</mml:mi>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mtext>&#x394;</mml:mtext>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>f</mml:mi>
<mml:msup>
<mml:mi>r</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:msup>
<mml:mi>P</mml:mi>
<mml:mi>i</mml:mi>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mtext>&#x394;</mml:mtext>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>r</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<label>(13)</label>
</disp-formula>
</p>
<p>The output weights <inline-formula id="inf93">
<mml:math id="m106">
<mml:mrow>
<mml:msubsup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>Out</mml:mtext>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> (weight between recurrent neuron <italic>j</italic> and output neuron <italic>i</italic>) are also updated in a similar way; the error is calculated using <xref ref-type="disp-formula" rid="e14">Eq 14</xref> as follows:<disp-formula id="e14">
<mml:math id="m107">
<mml:mrow>
<mml:msubsup>
<mml:mi>e</mml:mi>
<mml:mi>j</mml:mi>
<mml:mrow>
<mml:mi>o</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>z</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>Z</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>z</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>w</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>Z</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(14)</label>
</disp-formula>
</p>
</sec>
</sec>
</sec>
<sec id="s3">
<title>3 Experiments</title>
<sec id="s3-1">
<title>3.1 Different Scenarios</title>
<p>To understand the proficiency of this model, we trained and tested the agent on multiple different scenarios with different time intervals and different numbers of circles. We observed that the agent learned to produce a neural trajectory that peaked at the time-to-act intervals with near-perfect accuracy. <xref ref-type="fig" rid="F3">Figure&#x20;3</xref> demonstrates the learned neural trajectory of a few of the scenarios we trained. The colored bars in <xref ref-type="fig" rid="F3">Figure&#x20;3</xref> indicate the correct time-to-act interval.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Trained activity of four different scenarios. Each scenario contains different times to act. Each colored bar represents the time-to-act interval. The orange line in each figure represents the threshold (&#x3d;0.5).</p>
</caption>
<graphic xlink:href="fcteg-02-722092-g003.tif"/>
</fig>
<p>The proposed RNN training method exhibited some notable behavioral features, such as the following: 1) the agent learned to subdue its activity as soon as it observed a new state, analogous to restarting a clock, and 2) depending on the observed state, the agent learned to ramp its activity to peak at the time-to-act. We also observed that the agent could learn to do the same without training the recurrent weights (i.e.,&#x20;by only training the output weights <inline-formula id="inf94">
<mml:math id="m108">
<mml:mrow>
<mml:msub>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>). However, by training a percentage of the recurrent neurons, we observed that the agent could learn to produce the desired activity in relatively fewer episodes of training.</p>
</sec>
<sec id="s3-2">
<title>3.2 Temporal Scaling</title>
<p>It is interesting how humans can execute their actions, such as speaking, writing, or playing music at different speeds. Temporal scaling is another feature we observed in our proposed method. A few studies have explored temporal scaling in humans <xref ref-type="bibr" rid="B9">Diedrichsen et&#x20;al. (2007)</xref>; <xref ref-type="bibr" rid="B7">Collier and Wright (1995)</xref>, particularly the study by <xref ref-type="bibr" rid="B12">Hardy et&#x20;al. (2018)</xref>, which modeled temporal scaling using an RNN and a supervised learning method. Their approach involved training recurrent neurons using a second RNN that generates a target output for each of the recurrent neurons in the population. Unfortunately, this approach is not feasible with an online learning algorithm such as reinforcement learning. So, to explore the possibility of temporal scaling with our method, we trained the model using an additional speed input (shown in <xref ref-type="fig" rid="F4">Figure&#x20;4</xref>), using the same approach as is outlined in <xref ref-type="disp-formula" rid="e11">Eqs. 11</xref>, <xref ref-type="disp-formula" rid="e12">12</xref>, <xref ref-type="disp-formula" rid="e14">14</xref>. In this set-up, the RNN receives both a state input and a speed input. The speed input is a constant value given only when there is a state input; for the rest of the time, the speed input is zero. We trained the model only with one speed (<inline-formula id="inf95">
<mml:math id="m109">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>d</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) and tested it at three different speeds: <inline-formula id="inf96">
<mml:math id="m110">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>d</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1.3</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf97">
<mml:math id="m111">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>d</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.01</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, and <inline-formula id="inf98">
<mml:math id="m112">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>d</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.8</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>. <xref ref-type="fig" rid="F5">Figure&#x20;5</xref> shows the results. We observed that the shift in click time with respect to <inline-formula id="inf99">
<mml:math id="m113">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> could be defined using <xref ref-type="disp-formula" rid="e15">Eq 15</xref>. We used a similar procedure to that described in <xref ref-type="sec" rid="s2-3-2">Section 2.3.2</xref> to train for temporal scaling.<disp-formula id="e15">
<mml:math id="m114">
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>k</mml:mi>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>k</mml:mi>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>d</mml:mi>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mo>/</mml:mo>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mi>d</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>t</mml:mi>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mi>s</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>200</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(15)</label>
</disp-formula>
</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>RNN with speed as the input and state input <bold>(A)</bold> to the RNN <bold>(B)</bold>.</p>
</caption>
<graphic xlink:href="fcteg-02-722092-g004.tif"/>
</fig>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>RNN activity with a training speed of 1 and test speeds of 0.01, 0.8, and 1.3. The colored bars indicate the expected time-to-act intervals.</p>
</caption>
<graphic xlink:href="fcteg-02-722092-g005.tif"/>
</fig>
</sec>
<sec id="s3-3">
<title>3.3 Learning to Plan Multiple Future Times-to-Act</title>
<p>One of the inherent properties of an RNN is that it can produce multiple peaks at different time points, even with only one input at the start of the trial. Results of the study by <xref ref-type="bibr" rid="B12">Hardy et&#x20;al. (2018)</xref> showed that the output of the RNN (trained using supervised learning) peaked at multiple time points given a single input of 250&#xa0;ms at the start of the trial. To understand whether an agent could learn to plan such multiple future times-to-act given one state using the proposed training, we trained an agent on a slightly modified task-switching scenario. Here, the agent needed to click on the first circle at three different time intervals, 400&#x2013;500&#xa0;ms, 1,000&#x2013;1,100&#xa0;ms, and 1,700&#x2013;1,800&#xa0;ms, and on the second circle at 2,300&#x2013;2,400&#xa0;ms. The first circle was set to deactivate at 1,801&#xa0;ms. At the first state <inline-formula id="inf100">
<mml:math id="m115">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, the agent learned to produce a neural trajectory that peaked at three intervals, followed by state <inline-formula id="inf101">
<mml:math id="m116">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, which peaked at 2,300&#x2013;2,400&#xa0;ms, as shown in <xref ref-type="fig" rid="F6">Figure&#x20;6</xref>.</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>Multiple times to act. The state input <bold>(A)</bold> and output activity <bold>(B)</bold> which peaks at three different intervals after state <inline-formula id="inf102">
<mml:math id="m117">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and at one interval after state <inline-formula id="inf103">
<mml:math id="m118">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. The colored bars indicate the correct time-to-act.</p>
</caption>
<graphic xlink:href="fcteg-02-722092-g006.tif"/>
</fig>
</sec>
<sec id="s3-4">
<title>3.4 Skip State Test</title>
<p>As seen in experiment-3, the multiple peaks (multiple times-to-act) that the agent was producing could be based on its inherent property of the RNN. In reinforcement learning, however, the peak at the time-to-act should be truly dependent on each input state and also leverage the temporal properties of the RNN. Hence, to evaluate whether the learned network was truly dependent on the state, we tested it by skipping one of the input states. As <xref ref-type="fig" rid="F7">Figure&#x20;7</xref> shows, when the agent did not receive a state at 2,400 milliseconds, it did not choose to act during the 3,200&#x2013;3,300 interval, proving that the learned time-to-act is truly state dependent.</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption>
<p>Results of the skip state test. The top figures show the state input <bold>(left)</bold> and the corresponding RNN output <bold>(right)</bold>, where all states are present in the input. The bottom figures show the state input with the fourth skipped state <bold>(left)</bold>, which results in subdued output activity from 3,200 to 3,300&#xa0;ms <bold>(right)</bold>.</p>
</caption>
<graphic xlink:href="fcteg-02-722092-g007.tif"/>
</fig>
</sec>
<sec id="s3-5">
<title>3.5 Task Switching With 20 Tasks</title>
<p>To investigate the scalability of the proposed method to a relatively large state space, we trained and tested the model in a scenario consisting of 20 circles with 20 different times-to-act. <xref ref-type="fig" rid="F8">Figure&#x20;8</xref> demonstrates that the agent could indeed still learn the time-to-act with near-perfect accuracy.</p>
<fig id="F8" position="float">
<label>FIGURE 8</label>
<caption>
<p>RNN output when trained on a scenario with 20 circles. The colored bars indicate the expected time-to-act.</p>
</caption>
<graphic xlink:href="fcteg-02-722092-g008.tif"/>
</fig>
</sec>
<sec id="s3-6">
<title>3.6 Memory Task</title>
<p>From the above experiments, the agent was able to learn and employ its time representation in multiple ways. However, we are also interested to know for how long an agent can remember a given input. To investigate this, we delayed the time-to-act for 2,000&#xa0;ms after the offset of the input and trained the agent. The trained agent remembered a state seen at 0&#x2013;20&#xa0;ms until 2,000&#xa0;ms (see <xref ref-type="fig" rid="F9">Figure&#x20;9</xref>), which is indicated by the peak in the output activity. We also trained the agent to remember a state at 3,000&#xa0;ms. With the current amount of recurrent neurons (i.e.,&#x20;300 neurons), the agent was not able to remember for 3,000&#xa0;ms from the offset of an&#x20;input.</p>
<fig id="F9" position="float">
<label>FIGURE 9</label>
<caption>
<p>RNN output when trained on a scenario with two circles, where the first circle must be clicked after 2,000&#xa0;ms. The colored bars indicate the expected time-to-act.</p>
</caption>
<graphic xlink:href="fcteg-02-722092-g009.tif"/>
</fig>
</sec>
<sec id="s3-7">
<title>3.7 Shooting a Moving Target</title>
<p>Similar to the task-switching experiment, we trained the RL agent to learn &#x201c;when to act&#x201d; on a different scenario. In this scenario, the agent is rewarded for shooting a moving target. The target is the blob of a moving damped pendulum. The length of the pendulum is 1&#xa0;m, and the weight of the blob is 1&#xa0;kg. We trained the DQN to select the direction of shooting and the RNN to learn the exact time to release the trigger. The agent was rewarded positively for hitting the blob with an error of 0.1&#xa0;m and negatively if the agent missed the target. The learned activity is shown in <xref ref-type="fig" rid="F10">Figure&#x20;10</xref>; the left shows the motion of the pendulum and the right shows the learned RNN activity. The threshold in this experiment was 0.05, and the agent was able to hit the blob 5&#x20;times in 3,000&#xa0;ms. Although it is still not clear why the agent did not peak its activity from 0 to 1,500&#xa0;ms, the agent showed better performance after 1,500&#xa0;ms.</p>
<fig id="F10" position="float">
<label>FIGURE 10</label>
<caption>
<p>Left shows the pendulum scenario. The pendulum rod (the black line) is 1&#xa0;m long, and the blob (blue dot) weighs 1&#xa0;kg. Right shows the training RNN activity.</p>
</caption>
<graphic xlink:href="fcteg-02-722092-g010.tif"/>
</fig>
</sec>
</sec>
<sec id="s4">
<title>4 Comparison With Long Short-Term Memory (LSTM) Network</title>
<p>A recent study by <xref ref-type="bibr" rid="B8">Deverett et&#x20;al. (2019)</xref> investigated the interval timing abilities in a reinforcement learning agent. In the study, an RL agent was trained to reproduce a given temporal interval. However, the time representation in the study was in the form of movement (or velocity) control. In other words, the agent had to move from one point to the goal point within the same interval as presented at the start of the experiment. The agent which used the LSTM network in this study by <xref ref-type="bibr" rid="B8">Deverett et&#x20;al. (2019)</xref> performed the task with near-perfect accuracy, indicating the ability to learn temporal properties using LSTM networks. Following these findings, our study endeavors to understand if an agent can learn a direct representation of time (instead of an indirect representation of time, such as velocity or acceleration) using&#x20;LSTM.</p>
<p>In order to investigate in this direction, we trained an RL agent with only one LSTM network as its DQN network (no RNN was used in this test) on the same task-switching scenario. The input sequence for an RNN works in terms of <inline-formula id="inf104">
<mml:math id="m119">
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> (as shown in <xref ref-type="disp-formula" rid="e6">Eq 6</xref>), whereas input for LSTM works in terms of sequence length, as shown in <xref ref-type="fig" rid="F11">Figure&#x20;11</xref>. For example, an input signal with a length of 3,000&#xa0;ms can be given as 1&#xa0;ms at a time to an RNN, and for LSTM, the same input should be divided into a fixed length to effectively capture the temporal properties in the input. We used an LSTM with 100 input nodes and gave an input signal of 100&#x20;ms to the network, followed by the next 100&#xa0;ms. Indeed, the sequence length can be smaller than 100&#xa0;ms. In our experiments, we trained the agent with different sequence lengths (50, 100, 200, and 300&#xa0;ms), and the agent showed better performance for 300&#xa0;ms (results for 50, 100, and 200&#xa0;ms are given in the Appendix). The architecture of the LSTM we used contained one LSTM layer with 256 hidden units, 300 input nodes, and two linear layers with 100 nodes each. The output size of the network was 300, which resulted in an activity of <italic>n</italic> points for a given input signal of <inline-formula id="inf105">
<mml:math id="m120">
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mi>m</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. The hidden states of the LSTM network were carried on throughout the episode.</p>
<fig id="F11" position="float">
<label>FIGURE 11</label>
<caption>
<p>Input difference between the RNN and the LSTM network.</p>
</caption>
<graphic xlink:href="fcteg-02-722092-g011.tif"/>
</fig>
<p>The trained activity of the LSTM network is shown in <xref ref-type="fig" rid="F12">Figure&#x20;12</xref> (bottom), where the light blue region shows the output activity of the network. The colored bars in <xref ref-type="fig" rid="F12">Figure&#x20;12</xref> show the output activity of the LSTM network and the correct time-to-act intervals for clicking each circle. The LSTM network did learn to exceed the threshold indicating when to act at a few time-to-act intervals. However, there is periodicity learned by the network, meaning that for every 300&#xa0;ms, the network learned to produce similar activity.</p>
<fig id="F12" position="float">
<label>FIGURE 12</label>
<caption>
<p>Output activity of the trained LSTM network for a task-switching scenario containing four circles, with time-to-act intervals shown in colored&#x20;bars.</p>
</caption>
<graphic xlink:href="fcteg-02-722092-g012.tif"/>
</fig>
</sec>
<sec sec-type="discussion" id="s5">
<title>5 Discussion</title>
<p>In this study, we trained a reinforcement learning agent to learn &#x201c;when to act&#x201d; using an RNN and &#x201c;what to act&#x201d; using a DQN. We introduced a reward-based recursive least square algorithm to train the RNN. By disentangling the process of learning the temporal and spatial aspects of action into independent tasks, we intend to understand explicit time representation in an RL agent. Through this strategy, the agent learned to create its representation of time. Our experiments, which employed a peak-interval style, show that the agent could learn to produce a neural trajectory that peaked at the time-to-act with near-perfect accuracy. We also observed several other intriguing behaviors.<list list-type="simple">
<list-item>
<p>&#x2022; The agent learned to subdue its activity immediately after observing a new state. We interpreted this as the agent restarting its&#x20;clock.</p>
</list-item>
<list-item>
<p>&#x2022; The agent was able to temporally scale its actions in our proposed learning method. Even though we trained the agent with a single-speed value (<inline-formula id="inf106">
<mml:math id="m121">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>d</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>), it learned to temporally scale its action to speeds that were both lower (<inline-formula id="inf107">
<mml:math id="m122">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>d</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.01</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) and higher (<inline-formula id="inf108">
<mml:math id="m123">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>d</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1.3</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) than the trained speed. Notably, the agent was not able to scale its actions beyond <inline-formula id="inf109">
<mml:math id="m124">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>d</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1.3</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
</list-item>
<list-item>
<p>&#x2022; We observed that neural networks such as the LSTM might not be able to learn an explicit representation of time when compared with population clock models. <xref ref-type="bibr" rid="B8">Deverett et&#x20;al. (2019)</xref> showed that an RL agent can scale its actions (increase or decrease the velocity) using the LSTM network. However, when we trained the LSTM network to learn a direct representation of the time, it learned periodic activity.</p>
</list-item>
<list-item>
<p>&#x2022; In this research study, we trained an RL agent in a similar environment to task switching; shooting a moving target. The target in our experiment is a blob of a damped pendulum with a length of 1&#xa0;m and a mass of 1&#xa0;kg. The agent was able to shoot the fast-moving blob by learning to shoot at a few near-accurate time points.</p>
</list-item>
</list>
</p>
</sec>
</body>
<back>
<sec id="s6">
<title>Data Availability Statement</title>
<p>The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.</p>
</sec>
<sec id="s7">
<title>Author Contributions</title>
<p>All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.</p>
</sec>
<sec sec-type="COI-statement" id="s8">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x2019;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ack>
<p>This work was supported in part by the Australian Research Council (ARC) under discovery grant DP180100656 and DP210101093. Research was also sponsored in part by the Australia Defence Innovation Hub under Contract No. P18-650825, US Office of Naval Research Global under Cooperative Agreement Number ONRG &#x2010; NICOP &#x2010; N62909&#x2010;19&#x2010;1&#x2010;2058, and AFOSR &#x2012; DST Australian Autonomy Initiative agreement ID10134. We also thank the NSW Defence Innovation Network and NSW State Government of Australia for financial support in part of this research through grant DINPP2019 S1&#x2010;03/09 and PP21&#x2010;22.03.02.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>&#xc5;str&#xf6;m</surname>
<given-names>K. J.</given-names>
</name>
<name>
<surname>Wittenmark</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2013</year>). <source>Computer-controlled Systems: Theory and Design</source>. <publisher-loc>Englewood Cliffs, NJ</publisher-loc>: <publisher-name>Courier Corporation</publisher-name>.</citation>
</ref>
<ref id="B2">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bakker</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2002</year>). &#x201c;<article-title>Reinforcement Learning with Long Short-Term Memory</article-title>,&#x201d; in <conf-name>Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic</conf-name>, <conf-loc>Vancouver, Canada</conf-loc>, <fpage>1475</fpage>&#x2013;<lpage>1482</lpage>. </citation>
</ref>
<ref id="B3">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Buonomano</surname>
<given-names>D. V.</given-names>
</name>
<name>
<surname>Laje</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2011</year>). &#x201c;<article-title>Population Clocks</article-title>,&#x201d; in <source>Space, Time And Number In the Brain</source> (<publisher-name>Elsevier</publisher-name>), <fpage>71</fpage>&#x2013;<lpage>85</lpage>. <pub-id pub-id-type="doi">10.1016/b978-0-12-385948-8.00006-2</pub-id> </citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Buonomano</surname>
<given-names>D. V.</given-names>
</name>
<name>
<surname>Maass</surname>
<given-names>W.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>State-dependent Computations: Spatiotemporal Processing in Cortical Networks</article-title>. <source>Nat. Rev. Neurosci.</source> <volume>10</volume>, <fpage>113</fpage>&#x2013;<lpage>125</lpage>. <pub-id pub-id-type="doi">10.1038/nrn2558</pub-id> </citation>
</ref>
<ref id="B5">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Carrara</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Leurent</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Laroche</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Urvoy</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Maillard</surname>
<given-names>O. A.</given-names>
</name>
<name>
<surname>Pietquin</surname>
<given-names>O.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Budgeted Reinforcement Learning in Continuous State Space</article-title>,&#x201d; in <conf-name>Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019</conf-name>, <conf-loc>Vancouver, BC</conf-loc>, <conf-date>December 8&#x2013;14, 2019</conf-date>, <fpage>9295</fpage>&#x2013;<lpage>9305</lpage>. </citation>
</ref>
<ref id="B6">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Chung</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Gulcehre</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Cho</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Bengio</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2014</year>). <source>Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling</source> in <conf-name>NIPS 2014 Workshop on Deep Learning</conf-name>, <conf-loc>Quebec, Canada</conf-loc>, <conf-date>December, 2014</conf-date>. <comment>preprint arXiv:1412.3555</comment>.</citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Collier</surname>
<given-names>G. L.</given-names>
</name>
<name>
<surname>Wright</surname>
<given-names>C. E.</given-names>
</name>
</person-group> (<year>1995</year>). <article-title>Temporal Rescaling of Simple and Complex Ratios in Rhythmic Tapping</article-title>. <source>J.&#x20;Exp. Psychol. Hum. Perception Perform.</source> <volume>21</volume>, <fpage>602</fpage>&#x2013;<lpage>627</lpage>. <pub-id pub-id-type="doi">10.1037/0096-1523.21.3.602</pub-id> </citation>
</ref>
<ref id="B8">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Deverett</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Faulkner</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Fortunato</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Wayne</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Leibo</surname>
<given-names>J.&#x20;Z.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Interval Timing in Deep Reinforcement Learning Agents</article-title>,&#x201d; in <conf-name>33rd Conference on Neural Information Processing Systems (NeurIPS 2019)</conf-name>, <conf-loc>Vancouver, Canada</conf-loc>, <fpage>6689</fpage>&#x2013;<lpage>6698</lpage>. </citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Diedrichsen</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Criscimagna-Hemminger</surname>
<given-names>S. E.</given-names>
</name>
<name>
<surname>Shadmehr</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2007</year>). <article-title>Dissociating Timing and Coordination as Functions of the Cerebellum</article-title>. <source>J.&#x20;Neurosci.</source> <volume>27</volume>, <fpage>6291</fpage>&#x2013;<lpage>6301</lpage>. <pub-id pub-id-type="doi">10.1523/jneurosci.0061-07.2007</pub-id> </citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Doya</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2000</year>). <article-title>Reinforcement Learning in Continuous Time and Space</article-title>. <source>Neural Comput.</source> <volume>12</volume>, <fpage>219</fpage>&#x2013;<lpage>245</lpage>. <pub-id pub-id-type="doi">10.1162/089976600300015961</pub-id> </citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Durstewitz</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2003</year>). <article-title>Self-organizing Neural Integrator Predicts Interval Times through Climbing Activity</article-title>. <source>J.&#x20;Neurosci.</source> <volume>23</volume>, <fpage>5342</fpage>&#x2013;<lpage>5353</lpage>. <pub-id pub-id-type="doi">10.1523/jneurosci.23-12-05342.2003</pub-id> </citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hardy</surname>
<given-names>N. F.</given-names>
</name>
<name>
<surname>Goudar</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Romero-Sosa</surname>
<given-names>J.&#x20;L.</given-names>
</name>
<name>
<surname>Buonomano</surname>
<given-names>D. V.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>A Model of Temporal Scaling Correctly Predicts that Motor Timing Improves with Speed</article-title>. <source>Nat. Commun.</source> <volume>9</volume>, <fpage>4732</fpage>&#x2013;<lpage>4814</lpage>. <pub-id pub-id-type="doi">10.1038/s41467-018-07161-6</pub-id> </citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hochreiter</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Schmidhuber</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>1997</year>). <article-title>Long Short-Term Memory</article-title>. <source>Neural Comput.</source> <volume>9</volume>, <fpage>1735</fpage>&#x2013;<lpage>1780</lpage>. <pub-id pub-id-type="doi">10.1162/neco.1997.9.8.1735</pub-id> </citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Klapproth</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2008</year>). <article-title>Time and Decision Making in Humans</article-title>. <source>Cogn. Affective, Behav. Neurosci.</source> <volume>8</volume>, <fpage>509</fpage>&#x2013;<lpage>524</lpage>. <pub-id pub-id-type="doi">10.3758/cabn.8.4.509</pub-id> </citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Laje</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Buonomano</surname>
<given-names>D. V.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Robust Timing and Motor Patterns by Taming Chaos in Recurrent Neural Networks</article-title>. <source>Nat. Neurosci.</source> <volume>16</volume>, <fpage>925</fpage>&#x2013;<lpage>933</lpage>. <pub-id pub-id-type="doi">10.1038/nn.3405</pub-id> </citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Ge</surname>
<given-names>S. S.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Multilayer Formation Control of Multi-Agent Systems</article-title>. <source>Automatica</source> <volume>109</volume>, <fpage>108558</fpage>. <pub-id pub-id-type="doi">10.1016/j.automatica.2019.108558</pub-id> </citation>
</ref>
<ref id="B17">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Lillicrap</surname>
<given-names>T. P.</given-names>
</name>
<name>
<surname>Hunt</surname>
<given-names>J.&#x20;J.</given-names>
</name>
<name>
<surname>Pritzel</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Heess</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Erez</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Tassa</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2015</year>). <article-title>Continuous Control with Deep Reinforcement Learning</article-title>. <conf-name>4th International Conference on Learning Representations, (ICLR)</conf-name>, <conf-loc>San Juan, Puerto Rico</conf-loc>, <conf-date>May 2&#x2013;4, 2016</conf-date>. <comment>preprint arXiv:1509.02971</comment>.</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Matell</surname>
<given-names>M. S.</given-names>
</name>
<name>
<surname>Meck</surname>
<given-names>W. H.</given-names>
</name>
<name>
<surname>Nicolelis</surname>
<given-names>M. A. L.</given-names>
</name>
</person-group> (<year>2003</year>). <article-title>Interval Timing and the Encoding of Signal Duration by Ensembles of Cortical and Striatal Neurons</article-title>. <source>Behav. Neurosci.</source> <volume>117</volume>, <fpage>760</fpage>&#x2013;<lpage>773</lpage>. <pub-id pub-id-type="doi">10.1037/0735-7044.117.4.760</pub-id> </citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Miall</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>1989</year>). <article-title>The Storage of Time Intervals Using Oscillating Neurons</article-title>. <source>Neural Comput.</source> <volume>1</volume>, <fpage>359</fpage>&#x2013;<lpage>371</lpage>. <pub-id pub-id-type="doi">10.1162/neco.1989.1.3.359</pub-id> </citation>
</ref>
<ref id="B20">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Mnih</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Kavukcuoglu</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Silver</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Graves</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Antonoglou</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Wierstra</surname>
<given-names>D.</given-names>
</name>
<etal/>
</person-group> (<year>2013</year>). <source>Playing Atari with Deep Reinforcement Learning</source>. <publisher-name>arXiv</publisher-name>. <comment>preprint arXiv:1312.5602</comment>.</citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mnih</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Kavukcuoglu</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Silver</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Rusu</surname>
<given-names>A. A.</given-names>
</name>
<name>
<surname>Veness</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Bellemare</surname>
<given-names>M. G.</given-names>
</name>
<etal/>
</person-group> (<year>2015</year>). <article-title>Human-level Control through Deep Reinforcement Learning</article-title>. <source>Nature</source> <volume>518</volume>, <fpage>529</fpage>&#x2013;<lpage>533</lpage>. <pub-id pub-id-type="doi">10.1038/nature14236</pub-id> </citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Oh</surname>
<given-names>K.-K.</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>M.-C.</given-names>
</name>
<name>
<surname>Ahn</surname>
<given-names>H.-S.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>A Survey of Multi-Agent Formation Control</article-title>. <source>Automatica</source> <volume>53</volume>, <fpage>424</fpage>&#x2013;<lpage>440</lpage>. <pub-id pub-id-type="doi">10.1016/j.automatica.2014.10.022</pub-id> </citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Petter</surname>
<given-names>E. A.</given-names>
</name>
<name>
<surname>Gershman</surname>
<given-names>S. J.</given-names>
</name>
<name>
<surname>Meck</surname>
<given-names>W. H.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Integrating Models of Interval Timing and Reinforcement Learning</article-title>. <source>Trends. Cogn. Sci.</source> <volume>22</volume>, <fpage>911</fpage>&#x2013;<lpage>922</lpage>. <pub-id pub-id-type="doi">10.1016/j.tics.2018.08.004</pub-id> </citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Silver</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Schrittwieser</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Simonyan</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Antonoglou</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Guez</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). <article-title>Mastering the Game of Go without Human Knowledge</article-title>. <source>Nature</source> <volume>550</volume>, <fpage>354</fpage>. <pub-id pub-id-type="doi">10.1038/nature24270</pub-id> </citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Simen</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Balci</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>deSouza</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Cohen</surname>
<given-names>J.&#x20;D.</given-names>
</name>
<name>
<surname>Holmes</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>A Model of Interval Timing by Neural Integration</article-title>. <source>J.&#x20;Neurosci.</source> <volume>31</volume>, <fpage>9238</fpage>&#x2013;<lpage>9253</lpage>. <pub-id pub-id-type="doi">10.1523/jneurosci.3121-10.2011</pub-id> </citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sompolinsky</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Crisanti</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Sommers</surname>
<given-names>H.-J.</given-names>
</name>
</person-group> (<year>1988</year>). <article-title>Chaos in Random Neural Networks</article-title>. <source>Phys. Rev. Lett.</source> <volume>61</volume>, <fpage>259</fpage>. <pub-id pub-id-type="doi">10.1103/physrevlett.61.259</pub-id> </citation>
</ref>
<ref id="B27">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Tallec</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Blier</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Ollivier</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2019</year>). <source>Making Deep Q-Learning Methods Robust to Time Discretization</source>. <conf-name>International conference on machine learning (ICML)</conf-name>, <conf-loc>Long Beach</conf-loc>. <publisher-name>arXiv</publisher-name>. <comment>preprint arXiv:1901.09732</comment>.</citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vinyals</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Babuschkin</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Czarnecki</surname>
<given-names>W. M</given-names>
</name>
<name>
<surname>Mathieu</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Dudzik</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Junyoung</surname>
<given-names>C.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Grandmaster level in StarCraft II using multi-agent reinforcement learning</article-title>. <source>Nature</source> <volume>575</volume>, <fpage>350</fpage>&#x2013;<lpage>354</lpage>. </citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Watkins</surname>
<given-names>C. J.</given-names>
</name>
<name>
<surname>Dayan</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>1992</year>). <article-title>Q-learning</article-title>. <source>Machine Learn.</source> <volume>8</volume>, <fpage>279</fpage>&#x2013;<lpage>292</lpage>. <pub-id pub-id-type="doi">10.1023/a:1022676722315</pub-id> </citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xue</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Yao</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Formation Control of Multi-Agent Systems with Stochastic Switching Topology and Time-Varying Communication Delays</article-title>. <source>IET Control. Theor. Appl.</source> <volume>7</volume>, <fpage>1689</fpage>&#x2013;<lpage>1698</lpage>. <pub-id pub-id-type="doi">10.1049/iet-cta.2011.0325</pub-id> </citation>
</ref>
</ref-list>
</back>
</article>