Multiple-Timescale Neural Networks: Generation of History-Dependent Sequences and Inference Through Autonomous Bifurcations

Sequential transitions between metastable states are ubiquitously observed in the neural system and underlying various cognitive functions such as perception and decision making. Although a number of studies with asymmetric Hebbian connectivity have investigated how such sequences are generated, the focused sequences are simple Markov ones. On the other hand, fine recurrent neural networks trained with supervised machine learning methods can generate complex non-Markov sequences, but these sequences are vulnerable against perturbations and such learning methods are biologically implausible. How stable and complex sequences are generated in the neural system still remains unclear. We have developed a neural network with fast and slow dynamics, which are inspired by the hierarchy of timescales on neural activities in the cortex. The slow dynamics store the history of inputs and outputs and affect the fast dynamics depending on the stored history. We show that the learning rule that requires only local information can form the network generating the complex and robust sequences in the fast dynamics. The slow dynamics work as bifurcation parameters for the fast one, wherein they stabilize the next pattern of the sequence before the current pattern is destabilized depending on the previous patterns. This co-existence period leads to the stable transition between the current and the next pattern in the non-Markov sequence. We further find that timescale balance is critical to the co-existence period. Our study provides a novel mechanism generating robust complex sequences with multiple timescales. Considering the multiple timescales are widely observed, the mechanism advances our understanding of temporal processing in the neural system.

Despite the great success of these studies, however, some fundamental questions remain unanswered. In models that generate sequential metastable states, a transition between these states is embedded rigidly into the connectivity (i.e., the correlation between the current to the next pattern), resulting in successive patterns being determined by the immediately preceding pattern. Hence, the generation of sequences depending on the long history of the previous patterns is not possible. On the other hand, RNNs trained with machine learning methods allow for generating complex sequences dependent on history. The training methods require non-local information and have to retain the information until the sequence finishes, which is not biologically plausible. In addition, the formed sequences are vulnerable to noise or perturbation to the initial state (Laje and Buonomano, 2013).
To resolve these unanswered questions, we introduce a neural network model with slow and fast neurons that can learn the history-dependent sequences and connect the sequences. The fast neural dynamics generate patterns in response to an external input with the feedback from the slow dynamics. The slow dynamics store the history of the inputs via the fast dynamics, and feed the stored information back to the fast, as shown in Figure 1A. By this model, we provide a novel framework in temporal processing in the neural system in which the slow dynamics control successive bifurcations of fixed points of fast dynamics, based on the stored history of previous patterns and inputs. By adopting a biologically plausible learning rule based solely on the correlation between the pre-and post-synaptic neural activities as introduced previously Kaneko, 2013, 2016;Kurikawa et al., 2020), we demonstrate that our model with the fast and slow neural dynamics memorizes the history-dependent sequences and enables inference based on them.
Multiple-timescale neural dynamics are observed across cortical areas (Honey et al., 2012;Murray et al., 2014;Chaudhuri et al., 2015;Hasson et al., 2015). Neural activities in lower sensory cortices change in a faster timescale and respond instantaneously to stimuli, whereas those in higher association cortices change in a slower timescale and integrate information over longer periods. Cooperations between the higher and lower cortices are necessary to process the temporal information.
Some model studies focused on the multiple-timescale dynamics (Kiebel et al., 2008;Yamashita and Tani, 2008) and showed that their models generate history-dependent sequences. These models, however, adopted machine learning methods, and thus, biological plausibility is hard to be assured. In contrast, our model proposes a biologically plausible mechanism in which the higher cortices regulate the lower ones to generate complex sequences.
In the following, we focus on two basic aspects of neural sequences in temporal information processing; context (history)dependent sequences and inference. In the context-dependent working memory task (Mante et al., 2013;Stokes et al., 2013), distinct sequences of neural patterns are evoked by identical stimuli depending on the preceding context signals. Second, in this study, inference is defined as the ability to make appropriate responses against a new environment by using previously learned examples. For instance (Jones et al., 2012;Wikenheiser and Schoenbaum, 2016), consider a rat learning successive stimuli, A followed by B, and then reward C. After changing the environment, the rat is required to learn a new combination of stimuli, A' followed by B. In this situation, the rat is able to infer that stimuli A' causes the reward C via B. Neural activities reflecting this cognitive function should show sequential patterns A'BC even after learning only A'B. After showing basic behaviors in our model, we demonstrate that how such a context-dependent sequence is generated and how inference is executed.

Neural Model
We consider learning of K sequences, each of which contains M patterns, with K input patterns. We denote the µ-th targeted pattern in the α-th sequence as ξ α µ , and the corresponding input as η α for µ = 1, 2, · · · , M over the inputs α = 1, · · · , K. Figure 1A illustrates the case with K = 2 and M = 3: In this case, a given sequence ξ α 1 ,ξ α 2 ,ξ α 3 (α = 1, 2) should be generated upon a given corresponding input η α . Generally, a pattern to be generated next is determined not only by the current pattern but also by earlier patterns. Thus, a network has to retain the history of previous patterns to generate a sequence correctly.
To achieve this, we built a two-population model with different timescales, one with N fast neurons and one with N slow neurons, denoted as X and Y, respectively. X receives an external FIGURE 1 | (A) Schematic diagram of the proposed model for two sequences (K = 2) and three patterns (M = 3). (B) Neural dynamics during the learning process of three targets. Top: the time series of one of the fast variables x 0 (solid line) and the corresponding slow variable y 0 (broken line) during the learning process. Bottom: m x 1,2,3 , overlaps of x with ξ 1 1 (blue), ξ 1 2 (orange), and ξ 1 3 (green). The black line represents the overlap between x and y denoted as m xy . The bars above the panels indicate the targeted patterns given to the network in corresponding periods. (C) The fraction of successful recalls is plotted as a function of M for K = 1, 2. It is averaged over 50 realizations (10 networks and five pairs of the target and input patterns for each network). Here, a successful recall is defined as the case in which all K × M targets are sequentially generated in the correct order in the presence of the corresponding inputs.
input, and Y receives the output from X and provides input to X, as shown in Figure 1A. The neural activities x i in X and y i in Y evolve according to the following equation: where u i = N j =i J X ij x j ; r i = N j J XY ij tanh(y j ). J X ij is a recurrent connection from the j-th to the i-th neuron in X, and J XY ij is a connection from the j-th neuron in Y to the i-th neuron in X. J X is a fully connected network without self connections. It is modified during the learning process as described in the following subsection "Learning model" and initialized with the binary values P[J X ij = ±(N − 1) −1/2 ] = 1/2. The diagonal entries of J X are kept at zero during the entire learning process. J XY , in contrast, is a non-plastic sparse network; P(J XY ij = ±cN −1/2 ) = ρ and P(J XY ij = 0) = 1 − 2ρ. X is required to generate the pattern ξ α µ in the presence of η α , i.e., an attractor that matches ξ α µ is generated under η α . The i-th element of a targeted pattern denoted as (ξ α µ ) i , is assigned to the i-th neuron in X, and randomly sampled according to the probability P[(ξ α µ ) i = ±1] = 1/2. The input (η α ) i is injected to the i-th neuron in X, randomly sampled according to P[(η α ) i = ±1] = 1/2. ξ and η are the same dimensional vectors as the fast dynamics, i.e., N-dimensional vectors. We set N = 100, β x = 2, β y = 20, τ x = 1, τ y = 100, ρ = 0.05, and c = 7. The dependence of the performance on these parameters are shown in the Supplementary Materials, while the details for the parameter setting are described.
In Equation (3), we implement the non-linear activation at the apical dendrites (Larkum et al., 2009). We assumed that the input from Y to X innervated on the apical dendrites of neurons in X, which is consistent with observations that the feedback inputs from the higher cortical areas innervate on the apical dendrites of layer 5 pyramidal neurons (Larkum, 2013), whereas the recurrent input from X to X was assumed to innervate the proximal dendrites. The synaptic inputs to the apical dendrites are integrated and evoke the calcium spike when the integrated input exceeds the threshold of spikes (refer to the detail of this type of spike in Larkum et al., 2009). To reproduce this information processing at the apical dendrites, we used two nonlinear filters by the hyperbolic tangent function for the input from Y to X. First, by adopting tanh(y j ), the activity of the j-th neuron in Y is amplified in a nonlinear way at a synapse onto the neuron x i . Second, tanh(r i ) represents calcium spike at the branching point of the tuft dendrite. Even if these hyperbolic tangent functions are not leveraged in Equation (3), the behavior of the model is not changed qualitatively, although the performance of the model is reduced.

Learning Model
Only J X changes to generate the target according to the following equation: where τ syn is the learning speed (set to 100). This learning rule comprises a combination of a Hebbian term between the target and the presynaptic neuron, and an anti-Hebbian term between the pre-and post-synaptic neurons with a decay term u i J X ij for normalization 1 . This form satisfies locality across connections and is biologically plausible (Kurikawa et al., 2020). We previously applied this learning rule to a single network of X and demonstrated that the network learns K maps between 1 According to Eq. (4), d( j =i (J X ij ) 2 )/dt ∝ (1 − j =i (J X ij ) 2 ).
inputs and targets, i.e., M = 1 Kaneko, 2013, 2016;Kurikawa et al., 2020). However, in that case, generating a sequence (M ≥ 2) was not possible. In the present study, there are two inputs for X, one from input η and one from Y that stores previous information. Thus, the network can generate a pattern depending not only on the present input pattern, but also on the previous patterns.

Learning Procedure
In our model, the patterns in the sequence are learned sequentially. A learning step of a single pattern is accomplished when the neural dynamics satisfy the following two criteria: x sufficiently approaches the target pattern, i.e., m x µ ≡ i x i (ξ 1 µ ) i /N > 0.85, and y is sufficiently close to x, i.e., i x i y i /N > 0.5. After the completion of one learning step, a new pattern ξ 1 2 is presented instead of ξ 1 1 with a perturbation of fast variables x i , by multiplying a random number uniformly sampled from zero to one. We execute these steps sequentially from µ = 1 to M to learn a sequence once, denoted as one epoch of the learning. Before finishing the learning process, this procedure is repeated 20 times (i.e., 20 epochs). The second criterion for terminating the learning step is introduced for memorizing the sequences, especially the history-dependent sequences. Further, the value 0.5 of this criterion must take an intermediate value. If this criterion is not adopted or this criterion value is small, the target pattern is switched as soon as x is close to the target during the learning process. At this time, y is far from x because y is much slower than x. In this case, y cannot store any information about x. On the other side, when the value is close to unity, y matches x and y can store only the present x. In both cases, y cannot store the history of x.

Inference Task
In the inference task, we present sequentially different inputs in a sequence, whereas a single input is applied for a sequence in other tasks. We include the super-and sub-scripts in the notation of η as η α µ that represents the µ-th input pattern in the α-th sequence. In this task, a network learns three sequences:(S, A, B, C), (S, A ′ , B, C), and (D). The former two sequences are used for inference, whereas the last one is a distractor to prevent the over-stability of the other two sequences. First, in the learning process, the network learns (S, A, B, C) and (D). Second, it learns (S, A ′ , B) after training is completed. Then, we examine if the network generates (S, A ′ , B, C), implying inference from B to C.
For the first sequence (S, A, B, C), we apply η 1 1 = s for the target S, η 1 2 = a for A, and η 1 3 = η 1 4 = b for B and C. For the second sequence (A ′ , B, C), we apply η 1 1 = s for the target S, η 2 2 = a ′ for A ′ , and η 2 3 = b for B. All the targets and inputs are randomly sampled according to the probability P[(ξ α µ ) i = ±1] = P[(η α µ ) i = ±1] = 1/2. Because, in this task, the input pattern is changed in a single sequence, we apply the input and target over 100 unit times and change them to the next in the sequence through the learning process. In the recall process, we regularly change the input pattern every 100 unit times, independently of the value of the neural activities.

Principal Component Analysis (PCA)
We analyzed neural trajectories by using mainly PCA in Figures 2, 5-7 and Supplementary Figure S5. The N × T dimensional neural data of x is used for the PCA. In this study, T is the duration time to analyze neural dynamics multiplied by the sampling number of x per unit time (In this case, 20). For analysis of y, we also used PCs obtained by x.

Calculation of Success Rate in the Inference Task
To compute the success rate of the generation of the subsequence (B, C) under input b, we first identify the sequence of the patterns from the continuous neural activity by setting a threshold for the overlap value at 0.7. In the recall process, since the overlap with either (or a few) of the targets S, A, B, C, D, A ′ is selectively high most of the time, the sequences composed of some of the patterns S, A, B, C, D, A ′ are obtained for each and (s, v, b, b). In this study, v is a random pattern that is not used for learning. We generate these sequential patterns starting from 20 initial states for each of 100 network realizations and use the subpart of them in the presence of b to calculate the success rate of the sub-sequence (B, C). In this study, 100 network realizations are obtained by generating J X , J XY , ξ , and η 100 times according to independent and identically probability distributions as described in the "Neural model" in this section.
We test significant differences in the success rates of the generation of the sub-sequence (B, C) for different input sequences by Wilcoxon's signed-rank test. The success rates for (s, a ′ , b) and (s, v, b) are calculated from 20 sequences for each network realization. Hundred samples of two related paired success rates are obtained and used in Wilcoxon's signed-rank test.

RESULTS
Before exploring the history-dependent sequence, we analyzed if our learning rule generates simple sequences, namely, sequences in which the successive pattern is determined solely by the current pattern. Figure 1B shows a sample learning process for K = 1. We applied η 1 to a network and presented ξ 1 1 as the first pattern of a target sequence. After the transient time, x converges to ξ 1 1 due to synaptic change. y follows x according to Equation 2 and, consequently, moves to the target.
We present an example of a recall process after the learning process for (K, M) = (1, 3) in Figure 2A. In recall, the connectivity is not changed. The initial states of the fast variables are set at random values sampled from a uniform distribution of -1 to 1. The slow variables are set at values of their final states in the learning process in order for the network to generate the sequence starting from ξ 1 1 . (If the slow variables are set randomly in a similar manner as the fast variables, the network can still generate the sequence, but the first pattern of the sequence is not ξ 1 1 ). The targets appear sequentially in X in order. Note that in the recall process, the transition occurs spontaneously without any external operation. We explored the success rate of the learning and found that increasing M and K generally leads to a decrease in the success rate of recalls. For N = 100 and K = 1, the success rate is over 80% for M = 1 up to 11, and decreases beyond M = 12. For K = 2, the success rate is approximately 80% for M = 3 and decreases gradually as M increases ( Figure 1C, refer to the Supplementary Material for detailed results). Furthermore, we investigated how the balance between the timescale of the slow variables τ y and that of learning τ syn affect the success rate.
Next, the spontaneous activities without the input are analyzed. Figure 3A exemplifies the characteristic behavior of the complex spontaneous dynamics: fast oscillating activities and slowly varying ones appear alternatively. In the period of the oscillating activities, the memorized patterns are activated sequentially, which are not all memorized patterns, but their subsets, as shown in Figures 3B-D. Different subsets of patterns appear intermittently. For instance, (ξ 1 , ξ 2 , ξ 5 ), (ξ 3 , ξ 4 , ξ 5 ), and (ξ 1 , ξ 4 , ξ 5 ) are observed in Figures 3B-D, respectively. In contrast, in the period of the slowly varying activities, one or few patterns are stable for a while and then are collapsed.

Bifurcations of Fast Neural Dynamics
To elucidate how such a recall is possible, we analyzed the phase space of x with y quenched. In other words, y is regarded as bifurcation parameters for the fast dynamics. Specifically, we focused on the neural dynamics for 200 ≤ t ≤ 500, as shown in Figure 2A. In this period, the fast dynamics show transitions from ξ 1 1 to ξ 1 2 at t = 290, from ξ 1 2 to ξ 1 3 at t = 375, and from ξ 1 3 to ξ 1 1 at t = 220, 460. We sampled the slow variables every five units of time from t = 200 to 500, y t=200 , y t=205 , · · · , y t=500 , along the trajectory, and analyzed the dynamics of x with the slow variables quenched at each sampled y t=200,205,··· ,500 . Figure 2B shows the bifurcation diagram of x against the change in y, and Figure 2C shows the trajectories of x for specific y.
We now consider the neural dynamics for y t=225 , just after the transition from ξ 1 3 to ξ 1 1 [ Figure 2C (i)]. For this y, a single fixed point corresponding to the present pattern (ξ 1 1 ) exists, leading to its stability against noise. As y is changed, the basin of ξ 1 1 shrinks, while a fixed point corresponding to the next target ξ 1 2 appears, and its basin expands 2 , as shown in Figure 2C (ii). At y t=290 , the fixed point ξ 1 1 becomes unstable. Thus, the neural state x at ξ 1 1 goes out of there, and falls on ξ 1 2 , i.e., a transition occurs. With a further shift of y, y t=295,300,··· , a regime of coexistence of ξ 1 2 and ξ 1 3 with large basins appears [ Figure 2C (iii)]. The basin of the attractor ξ 1 2 shrinks and vanishes [ Figure 2C (iv)], and the transition from ξ 1 2 to ξ 1 3 occurs at t = 375. The next transition from ξ 1 3 to ξ 1 1 occurs in the same manner at t = 460. These processes provide the mechanism for robust sequential recall: fixed points x of the current and successive targets coexist, and then, the current target becomes unstable when the slow variables change. To examine the robustness of the recall, we explored trajectories from different initial conditions with Gaussian white noise with strength s (refer to Supplementary Material for details). All of these trajectories converge correctly to a target sequence after some transient period for weak noise. By increasing the noise strength, the recall performance of noisy dynamics is made equal to that of the noiseless dynamics up to noise strength s = 0.3. For stronger noise, the duration of residence at the target is decreased, because the neural state of x is kicked out of the target earlier than in the noiseless case. Even upon applying a strong and instantaneous perturbation to both x and y, the trajectory recovers the correct sequence. The sequence is represented as a limit cycle containing x and y and, thus, is recalled robustly.

Inference by Concatenation
Next, we test if our model flexibly infers new sequences based on the previously learned sequence. To this end, we consider the following task (Refer to Materials and methods for details). First, a network learns a sequence (ξ 1 2 , ξ 1 3 , ξ 1 4 ) = (A, B, C) in response to the associated input sequences (η 1 2 , η 1 3 , η 1 4 ) = (a, b, b). In addition, we provide η 1 1 = s associated with ξ 1 1 = S preceding the sequence as a fixation cue and the response to it (e. g., the subject's gaze to the fixation point), respectively. After the learning is completed, the network should generate the sequence (S, A, B, C) in response to the input sequence (s, a, b, b). Then, the network learns a new sequence (ξ 2 1 , ξ 2 2 , ξ 2 3 ) = (S, A ′ , B), which is associated with (η 2 1 , η 2 2 , η 2 3 ) = (s, a ′ , b). In this study, we intend to examine if the inference of C is achieved selectively from (S, A ′ , B) in the presence of b. For it, we need to prevent the tight and trivial association between input b and the sub-sequence (B, C). For this purpose, the network is also postulated to learn the association between η 3 1 = b and ξ 3 1 = D as a distractor. We explore if the network generates the sequence (S, A ′ , B, C) in response to the input sequence (s, a ′ , b, b) after learning the association between (S, A ′ , B) and (s, a ′ , b).
During the learning of the first sequence, the overlaps with all of the targets reach more than 0.9 after 20 epochs of learning, as shown in Supplementary Figure S4A. Actually, Figure 4A shows that the first sequence is successfully generated. Next, the network learns the new sequential patterns (S, A ′ , B). If the network infers (S, A ′ , B, C) by using the already learned sub-sequence (B, C), it generates the sequence (S, A ′ , B, C) after learning only (S, A ′ , B) without (S, A ′ , B, C). As expected, the average overlaps with all of the targets in the second sequence (not only A ′ ,B,but also C) are increased through learning (Figure 4B). After the fifth epoch of learning, the overlap with C declines, whereas those with A ′ and B continuously increase. As an example, we plot the recall dynamics after learning A ′ and B in Figure 4C. A ′ evokes B and C, although the overlap with the first target A ′ is not large. Simultaneously, the network generates the first sequence (Supplementary Figure S4C). Note that the sub-sequence (B, C) is generated selectively by the input either (s, a, b, b) or (s, a ′ , b, b). In contrast, when a random input v is given instead of a or a ′ in an input sequence, the sub-sequence (B, C)is not evoked, as shown in Supplementary Figure S4B. To examine the difference among the recall dynamics in response to (s, a, b), (s, a ′ , b), and (s, v, b), the success rate of generating the sub-sequence (refer to its definition in Materials and methods) is analyzed statistically in Figure 4D. We found that the input sequences (s, a, b), (s, a ′ , b) evoke the sub-sequence (B, C) with a significantly higher rate than the input sequence (s, v, b). Thus, our model is able to infer a new sequence based on the previously learned sequence.

Learning of History-Dependent Sequences
We examined if the proposed model learns the historydependent sequence (M = 6), in which the same patterns exist in a sequence such as (ξ 1 1 , ξ 1 2 , · · · , ξ 1 6 ) = (A, B, C, D, B, E). The patterns succeeding B are C or E, depending on whether the previous pattern is A or D. Then, the neural dynamics have to retain the information of the target A or D, to recall the target C or E correctly. Our model succeeded in recalling this sequence, as shown in Figure 5A. Just before the target C and E are recalled, there is no clear difference in the values of fast variables x, as indicated by the circles in Figure 5B. However, the values of slow variables y are different, depending on the previous targets shown in Figure 5C, which stabilize different patterns of x. Furthermore, we demonstrate that our model succeeded in recalling more complex sequences (M = 8) such as (ξ 1 1 , ξ 1 2 , · · · , ξ 1 8 ) = (A, B, C, D, E, B, C, F), as shown in Figures 5D-F. In this case, the neural dynamics have to keep three previous targets in memory to recall the target D or F after B and C. As expected, generating the sequence with M = 8 is a harder task than that with M = 6. However, some networks still can generate the sequence.
To understand the performance comprehensively, we measured the success rate in generating these sequences for different lengths. For the sequence with the length of M, the network is required to retain the information about M/2 − 1 preceding patterns. We examined the success rate for M = 6, 8, and 10 over 50 networks realizations in the same manner as that in the inference task. The success rate is reduced for the longer M and nearly zero for M = 10. This result indicates that our model can store three preceding patterns at a maximum, but is difficult to memorize four preceding patterns.
We, next, explored whether the model can memorize another type of the history-dependent sequence such as (ξ 1 1 , ξ 1 2 , · · · , ξ 1 6 ) = (A, B, A, C, A, D), as shown in Figure 6. The network is required to discriminate three neural states in the slow dynamics just before x approaches B, C, and D, as shown by circles in Figure 6C. When the network discriminates these states successfully, it generates the sequence adequately, as shown in Figures 6A-C. We measured the success rate in generating these sequences for different numbers of the states to be discriminated (namely, M/2 states for an M-pattern sequence) in Figure 6D. For the shortest case, the success rate takes less than 0.4.
As a final example of the complex sequences, we explored learning two history-dependent sequences (Figure 7), namely, (ξ 1 1 , ξ 1 2 , ξ 1 3 )=(A, B, C) upon η 1 , and (ξ 2 1 , ξ 2 2 , ξ 2 3 )=(C, B, A) upon η 2 . In these sequences, the flow A → B → C on the state space under η 1 should be reversed under η 2 . The learned network succeeds in generating these sequences. Although orbits of x under inputs almost overlap in the 2-dimensional space, those of y does not. This difference in y, in addition to inputs, allows the orbits of x in the reverse order of patterns. Generally, y is different depending on the history of the previous patterns and inputs even when x is the same. Different y stabilizes different fixed point of x, to generate the history-dependent sequence.
Generating these sequences is rather hard. Actually, the success rate is 0.14 for three-pattern sequences [(A, B, C) under The success rate in generating these sequences is shown for different lengths. For generating the sequence with M, the network is required to store the information about M/2 − 1 preceding patterns. We measured the success rate over 50 network realizations and plotted it as a function of the number of the preceding patterns to be stored. The success rate in generating these sequences is shown for different lengths. For generating the sequence with M, the network is required to discriminate M/2 neural states just before targets B, C, D, . . . are recalled. We measured the success rate over 50 network realizations and plotted it as a function of the number of the states to be discriminated. one input, (C, B, A) under the other input] and 0.08 for fourpattern sequences. The success rate is low even for the shortest sequences. The difficulty of this task could be attributed to the request that networks have to memorize the bidirectional transitions between the target patterns (namely, the transition from A to B and its reverse transition) dependent on the external input.

Timescale Dependence
Recall performance is highly dependent on the relation between τ x , τ y , and τ syn . To investigate the dependence of the performance on the timescales, we trained fifty realizations of networks for various values of timescales and calculated the success rate of training as a function of τ syn for different τ y by fixing τ x at 1, as are plotted after rescaling τ syn by τ y in Figure 8A. The ratios yield a common curve that shows an optimal value ∼ 1 at τ syn , approximately equal to τ y . As an exceptional case, the success rate for τ y = 10 yields a lower value for the optimal τ syn because τ y is too close to τ x to store the information about x. The balance between τ syn and τ y is important to regulate the success rate when they are sufficiently smaller than τ x .
To unveil the significance of the timescale balance, we, first, present how the recall is failed for τ y >> τ syn , (τ y = 100, τ syn = 10 in Figure 8B). Some of the targets are recalled sequentially in the wrong order, whereas other targets do not appear in the recall process. To uncover the underlying mechanism of the failed The success rate of recalls as functions of τ syn for given τ y . The curves of the success rates are rescaled by τ y . Different colors represent different τ y indicated by bars below panels. The success rate is calculated across fifty realizations for K = 1, M = 7. (B and C) The neural activities of x (upper) and y (lower) in the recall process are shown by using the overlaps with the targets in the same color as shown in Figure 2A. The neural dynamics for τ y = 100, τ syn = 10 are shown in B, while those for τ y = 100, τ syn = 1, 000 are shown in (C). recall, we analyze the neural dynamics of fast variables with slow variables quenched in a manner similar to that shown in Figure 2 (refer to Supplementary Figure S5A). Here, all the targets are stable for certain y, although ξ 1 2 does not appear in the recall process. We also found that fixed points corresponding to ξ 1 1 and ξ 1 2 do not coexist for any y: the fixed point corresponding to ξ 1 3 has a large basin across all y. This leads to a transition from ξ 1 1 to ξ 1 3 by skipping ξ 1 2 , and thus, the recall is failed. Interestingly, how a recall is failed for τ y << τ syn are distinct from that for τ y >> τ syn . For τ y = 100, τ syn = 1, 000, only the most recently learned target is stable for almost all y, and thus, only this target is recalled, as shown in Figure 8C. We sampled the slow variables from the last learning step of the sequence (Supplementary Figure S5B), and analyzed the bifurcation of the fast variables against change in slow variables, in the same way as above. In this study, only the latest target (here, ξ 1 3 ) is a fixed point, whereas the other targets are not. Thus, transitions between targets are missed, except the transition to the latest target.
Why does the network fail in generating the sequences when τ y and τ syn are not matched? The reason is as follows: the learning time for the non-optimal timescales takes a longer time than that for the optimal ones. In this study, we consider the time that is normalized by τ syn because it characterizes the timescale regulating the synaptic plasticity and, consequently, the neural dynamics. For τ y >> τ syn (Supplementary Figure S5C), the trajectories of neural activities are similar to those for τ y ∼ τ syn , but it takes a longer time for y to approach x after x converges to the target. As the approach to x is so slow, the learning process stabilizes the present target during the approach by modifying the connectivity, so that the target is too stable over a wide range of y. On the other hand, for τ y << τ syn (Supplementary Figure  S5E), the neural activities of x and y wander before x converges to the target, resulting in disruption of information about the previous targets stored in y. Thus, the networks for both cases of τ y >> τ syn and τ y << τ syn fail to generate the sequence. These results indicate that the relative timescale τ y to τ syn changes the bifurcation of the fast dynamics and the memory capacity.

DISCUSSION
Sequential transitions between metastable patterns are ubiquitously observed in the neural system (Miller, 2016) during various tasks, such as perception (Jones et al., 2007;Miller and Katz, 2010), decision making (Ponce-Alvarez et al., 2012), working memory (Stokes et al., 2013;Taghia et al., 2018),and recall of long-term memory (Wimmer et al., 2020). We have developed a novel neural network model with fast and slow dynamics to generate sequences with non-Markov property and concatenate sequences, which are based on these cognitive functions.
In a standard model for generating sequential patterns (Kleinfeld, 1986;Sompolinsky and Kanter, 1986;Nishimori et al., 1990;Russo and Treves, 2012;Recanatesi et al., 2015;Haga and Fukai, 2019), asymmetric Hebbian learning between a pattern µ and the next µ + 1, i.e., ξ µ+1 (ξ µ ) t , is used to create the transition from ξ µ to ξ µ+1 (Kleinfeld, 1986;Sompolinsky and Kanter, 1986;Russo and Treves, 2012;Recanatesi et al., 2015;Haga and Fukai, 2019). In these studies, however, only the connections between the current and immediately preceding patterns are embedded in the connectivity, resulting in that the prolonged history of the patterns cannot be embedded. In other studies, Rabinovich's group (Seliger et al., 2003;Rabinovich and Varona, 2018) proposed the model generating sequential activities by heteroclinic orbits between patterns. As the above standard model, asymmetric Hebbian learning forms the connectivity for generating the sequences (Seliger et al., 2003). The information about the history of patterns is not stored in the model. Thus, non-Markov sequences are not generated in contrast to our model 3 .
In some models, a term that changes slower than the neural dynamics (e.g., an adaptation term) is introduced to lead to the transition. In Gros (2007), Russo and Treves (2012), and Recanatesi et al. (2015), the slow term is introduced to destabilize the current pattern. These methods imply non-Markov dynamics because the slow term needs prolonged times to recover, leading to change in the transition probabilities among the patterns. However, this term does not determine the next pattern and, thus, some additional mechanism is necessary for the transition to the desired pattern. The feedback from the slow population in our model, in contrast, not only destabilizes the current pattern but also simultaneously stabilizes the next targeted pattern. As the current and next patterns coexist for some time, the robust transition between them is achieved.
Alternatively, supervised learning methods used in machine learning fields, such as Back-Propagation Through Time (BPTT) (Werbos, 1990), are investigated to reproduce sequential neural activities observed experimentally (Mante et al., 2013;Carnevale et al., 2015;Chaisangmongkon et al., 2017), including non-Markov trajectories (Sussillo and Abbott, 2009;Laje and Buonomano, 2013). The BPTT, however, requires non-local information and the network has to retain a large amount of information until the trajectory terminates, which is biologically implausible. Furthermore, the trajectories shaped by this method are vulnerable to noise (Laje and Buonomano, 2013). Our model is free from these deficiencies.
In our model, the recurrent connections in the fast population (i.e., connections within a cortical area) are modified to shape the transitions between memorized states whereas the connections between the fast and slow populations and those from the input to the fast population are fixed (i.e., connections across cortical areas). In another approach, Gros and Kaczor (2010) demonstrated that the plasticity in the afferent connections with the fixed recurrent connections is useful for semantic learning, by connecting appropriately external stimuli with already established neural patterns in the recurrent network. In the neural system, generally, both connections across and within cortical areas are plastic. The existence of both dual plasticity possibly leads to interference between these connections, potentially resulting in reducing the learning performance. Future studies are needed to clarify how such dual plasticity cooperatively builds neural activities to perform cognitive functions.
Timescales in the neural activities are hierarchically distributed across several cortical areas (Honey et al., 2012;Murray et al., 2014;Hasson et al., 2015;Runyan et al., 2017). For instance, consider the hippocampus (HPC) and the prefrontal cortex (PFC), which are coupled by mono-synaptic and disynaptic connections (Ito et al., 2015). HPC neurons respond to the location of animals (Kumaran et al., 2016) with faster timescales than those in PFC, which has the slowest timescale among cortical areas (Murray et al., 2014). Experimental studies (Ito et al., 2015;Guise and Shapiro, 2017) revealed that PFC neurons are necessary to differentiate HPC dynamics depending on the context and previous experience. Similarly, neurons in the orbitofrontal cortex (OFC), whose timescales are considered to be slower than those in HPC, are necessary for concatenating the sequences in the stimulus-reward response (Jones et al., 2012;Wikenheiser and Schoenbaum, 2016). Accordingly, it is suggested that the area with the slow dynamics is necessary to generate and concatenate the sequences.
Neural networks with multiple timescales are investigated theoretically in several studies. In some studies (Yamashita and Tani, 2008;Perdikis et al., 2011), the slow dynamics are introduced to concatenate primitive movements and produce a complex movement, while hidden states of the hierarchical external stimuli are inferred by the multiple timescales in the neural dynamics in another study (Kiebel et al., 2009). In Kiebel et al. (2009) and Perdikis et al. (2011), the relationship between the slow and fast dynamics are fixed a priori to perform their tasks, whereas, in our model, such a relationship is shaped through the learning process. In Yamashita and Tani (2008), the BPTT method is adopted for training the network; thus, it faced the same drawbacks as already mentioned. In the studies of the multiple timescales system, analytical methods such that singular perturbation methods are adopted, which are commonly used to elucidate the transition between the states on the different slow manifolds (Ermentrout, 1998;Rubin et al., 2013;Bertram and Rubin, 2017;Wernecke et al., 2018) and the stability of fixed points (Meyer-Bäse et al., 1996;Hongtao and Amari, 2006). Our model provides how these transitions are formed through learning and it generates and concatenates the history-dependent sequences, while the application of these methods will be useful in the future.
As for the timescales, we need further studies to fill a gap between our model and experimental observations. The ratio of the timescale in the slow dynamics to that in the fast dynamics is less than 10 times across cortical areas (Wang and Kennedy, 2016), which are smaller than the optimal ratio in our model. Further, the difference between the timescales in the slow dynamics (on the order of a second) and in the synaptic plasticity (on the order of a minute Bliss and Lomo, 1973;Bayazitov et al., 2007) is larger than that adopted in our model. Diversity in the timescales of individual neurons and the calcium dynamics possibly resolve this discrepancy. The timescale of individual neurons in the same area is distributed over two digits (Bernacchia et al., 2011;Wasmuht et al., 2018). The calcium dynamics in the synapses can modify the synaptic efficacy on the order of a second (Shouval et al., 2010;Graupner and Brunel, 2012). By taking these effects into account, our model may be consistent with the experimental observations, although further studies will be important, including those with spiking neurons (Kurikawa and Kaneko, 2015) and spike-timingdependent potentiation.
Finally, we discuss the biological plausibility of the learning rule in our model. The fast network receives two inputs; an external input (η) and the input from the slow network. In the neural system, the external input is conveyed through afferent connections from a lower cortical area (or sensory input) and the feedback input comes from a higher cortical area. In addition to these inputs, another input for the learning is introduced in our model, which provides information to generate sequential patterns to be learned. Thus, our network is trained to map between sensory cues and sequential patterns in the output area by using a Hebbian rule (correlation between ξ j and x i ) and an anti-Hebbian rule (correlation between x i and x j ). After training, the network evokes the sequential patterns under the sensory input.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.