Reinforcement Learning With Low-Complexity Liquid State Machines

We propose reinforcement learning on simple networks consisting of random connections of spiking neurons (both recurrent and feed-forward) that can learn complex tasks with very little trainable parameters. Such sparse and randomly interconnected recurrent spiking networks exhibit highly non-linear dynamics that transform the inputs into rich high-dimensional representations based on the current and past context. The random input representations can be efficiently interpreted by an output (or readout) layer with trainable parameters. Systematic initialization of the random connections and training of the readout layer using Q-learning algorithm enable such small random spiking networks to learn optimally and achieve the same learning efficiency as humans on complex reinforcement learning (RL) tasks like Atari games. In fact, the sparse recurrent connections cause these networks to retain fading memory of past inputs, thereby enabling them to perform temporal integration across successive RL time-steps and learn with partial state inputs. The spike-based approach using small random recurrent networks provides a computationally efficient alternative to state-of-the-art deep reinforcement learning networks with several layers of trainable parameters.


Introduction
High degree of recurrent connectivity among neuronal populations is a key attribute of neural microcircuits in the cerebral cortex and many different brain regions [1,2,3].Such common structure suggests the existence of a general principle for information processing.However, the principle underlying information processing in such recurrent population of spiking neurons is still largely elusive due to the complexity of training large recurrent Spiking Neural Networks (SNNs).In this regard, reservoir computing architectures [4,5,6] were proposed to minimize the training complexity of large recurrent neuronal populations.Liquid State Machine (LSM) [4,5] is a recurrent SNN consisting of an input layer sparsely connected to a randomly interlinked reservoir (or liquid) of excitatory and inhibitory spiking neurons whose activations are passed on to a readout (or output) layer, trained using supervised algorithms, for inference.The key attribute of an LSM is that the input-to-liquid and the recurrent excitatory↔inhibitory synaptic connectivity matrices and weights are fixed a priori.LSM effectively utilizes the rich nonlinear dynamics of Leaky-Integrate-and-Fire spiking neurons [7] and the sparse random input-to-liquid and recurrent-liquid synaptic connectivity for processing spatio-temporal inputs.At any time instant, the spatio-temporal inputs are transformed into a high-dimensional representation, referred to as the liquid states (or spike patterns), which evolves dynamically based on decaying memory of the past inputs.The memory capacity of the liquid is dictated by its size and degree of recurrent connectivity.Although the LSM, by construction, does not have stable instantaneous internal states like Turing machines [8] or attractor neural networks [9], prior studies have successfully trained the readout layer using liquid activations, estimated by integrating the liquid states (spikes) over time, for

Introduction
High degree of recurrent connectivity among neuronal populations is a key attribute of neural microcircuits in the cerebral cortex and many different brain regions [1,2,3].Such common structure suggests the existence of a general principle for information processing.However, the principle underlying information processing in such recurrent population of spiking neurons is still largely elusive due to the complexity of training large recurrent Spiking Neural Networks (SNNs).In this regard, reservoir computing architectures [4,5,6] were proposed to minimize the training complexity of large recurrent neuronal populations.Liquid State Machine (LSM) [4,5] is a recurrent SNN consisting of an input layer sparsely connected to a randomly interlinked reservoir (or liquid) of excitatory and inhibitory spiking neurons whose activations are passed on to a readout (or output) layer, trained using supervised algorithms, for inference.The key attribute of an LSM is that the input-to-liquid and the recurrent excitatory↔inhibitory synaptic connectivity matrices and weights are fixed a priori.LSM effectively utilizes the rich nonlinear dynamics of Leaky-Integrate-and-Fire spiking neurons [7] and the sparse random input-to-liquid and recurrent-liquid synaptic connectivity for processing spatio-temporal inputs.At any time instant, the spatio-temporal inputs are transformed into a high-dimensional representation, referred to as the liquid states (or spike patterns), which evolves dynamically based on decaying memory of the past inputs.The memory capacity of the liquid is dictated by its size and degree of recurrent connectivity.Although the LSM, by construction, does not have stable instantaneous internal states like Turing machines [8] or attractor neural networks [9], prior studies have successfully trained the readout layer using liquid activations, estimated by integrating the liquid states (spikes) over time, for speech recognition [4,10,11,12], image recognition [13], gesture recognition [14,15], and sequence generation tasks [16,17,18].
In this work, we propose such sparse randomly-interlinked low-complexity LSMs for solving complex Reinforcement Learning (RL) tasks, which involve an autonomous agent (modeled using the LSM) trained to select actions in a manner that maximizes the expected future rewards received from the environment.For instance, a robot (agent) learning to navigate a maze (environment) based on the reward and punishment received from the environment is an example RL task.At any given time, the environment state (converted to spike trains) is fed to the liquid, which produces a high-dimensional liquid state (spike pattern) based on decaying memory of the past environment states.We present an optimal initialization strategy for the fixed input-to-liquid and recurrent-liquid synaptic connectivity matrices and weights to enable the liquid to produce high-dimensional representations that lead to efficient training of the liquid-to-readout weights.Artificial rate-based neurons for the readout layer takes the liquid activations and produces action-values to guide action selection for a given environment state.The liquid-to-readout weights are trained using the Q-learning RL algorithm proposed for deep learning networks [19].In RL theory [20], the Q-value, also known as the action-value, estimates the expected future rewards for a state-action pair that specifies how good is the action for the current environment state.The readout layer of the LSM contains as many neurons as the number of possible actions for a particular RL task.At any given time, the readout neurons predict the Q-value for all possible actions based on the high-dimensional state representation provided by the liquid.The liquid-to-readout weights are then trained using backpropagation [21] to minimize the error between the Q-values predicted by the LSM and the target Q-values estimated from RL theory [22] as described in subsection 3.2.We adopt -greedy policy (explained in subsection 3.2) to select the appropriate action based on the predicted Q-values during training and evaluation.Based on -greedy policy, a lot of random actions are picked in the beginning of the training phase to better explore the environment.Towards the end of training and during inference, the action corresponding to the maximum Q-value is selected with higher probability to exploit the learnt experiences.We first demonstrate results for training the readout weights based on the high-dimensional representations provided by the liquid, as a result of the sparse recurrent-liquid connectivity, on simple Cartpole-balancing RL task [20].We then comprehensively validate the capability of the LSM and the presented training methodology on complex RL tasks like Pacman [23] and Atari games [24].We note that LSM has been previously trained using Q-learning for RL tasks pertaining to robotic motion control [25,26,27].We demonstrate and benchmark the efficacy of appropriately initialized LSM for solving RL tasks commonly used to evaluate deep reinforcement learning networks.In essence, this work provides a promising step towards incorporating bio-plausible low-complexity recurrent SNNs like LSMs for complex RL tasks, which can potentially lead to much improved energy efficiency in event-driven asynchronous neuromorphic hardware implementations [28,29].

Liquid State Machine: Architecture and Initialization
Liquid State Machine (LSM) consists of an input layer sparsely connected via fixed synaptic weights to a randomly interlinked liquid of excitatory and inhibitory spiking neurons followed by a readout layer as depicted in Figure 1.The input layer (denoted by P ) is modeled as a group of excitatory neurons that spike based on the input environment state following a Poisson process.The sparse input-to-liquid connections are initialized such that each excitatory neuron in the liquid receives synaptic connections from approximately K random input neurons.This guarantees uniform excitation of the liquid-excitatory neurons by the external input spikes.The fixed input-to-liquid synaptic weights are chosen from a uniform distribution between 0 and α as shown in Table 3, where α is the maximum bound imposed on the weights.The liquid consists of excitatory neurons (denoted by E) and inhibitory neurons (denoted by I) recurrently connected in a sparse random manner as illustrated in Figure 1.The number of excitatory neurons is chosen to be 4× the number of inhibitory neurons as observed in the cortical circuits [30].We use the Leaky-Integrate-and-Fire (LIF) model [7] to mimic the dynamics of both excitatory and inhibitory spiking neurons as described by the following differential equations:

Readout Game snapshot
Figure 1: Illustration of the LSM architecture consisting of an input layer sparsely connected via fixed synaptic weights to randomly recurrently connected reservoir (or liquid) of excitatory and inhibitory spiking neurons followed by a readout layer composed of artificial rate-based neurons.
where V i is the membrane potential of the i-th neuron in the liquid, V rest is the resting potential to which V i decays to, with time constant τ , in the absence of input current, and I i (t) is the instantaneous current projecting into the i-th neuron, and N P , N E , and N I are the number of input, excitatory, and inhibitory neurons, respectively.The instantaneous current is a sum of three terms: current from input neurons, current from excitatory neurons, and current from inhibitory neurons.The first term integrates the sum of pre-synaptic spikes, denoted by δ(t − t l ) where t l is the time instant of pre-spikes, with the corresponding synaptic weights (W li in Equation 2).Likewise, the second (third) term integrates the sum of pre-synaptic spikes from the excitatory (inhibitory) neurons, denoted by δ(t − t j ) (δ(t − t k )), with the respective weights W ji (W ki ) in Equation 2. The neuronal membrane potential is updated with the sum of the input, excitatory, and negative inhibitory currents as shown in Equation 1.When the membrane potential reaches a certain threshold V thres , the neuron fires an output spike.The membrane potential is thereafter reset to V reset and the neuron is restrained from spiking for an ensuing refractory period by holding its membrane potential constant.The LIF model parameters for the excitatory and inhibitory neurons are listed in Table 4.
There are four types of recurrent synaptic connections in the liquid, namely, E→E, E→I, I→E, and I→I.We express each connection in the form of a matrix that is initialized to be sparse and random, which causes the spiking dynamics of a particular neuron to be independent of most other neurons and maintains separability in the neuronal spiking activity.However, the degree of sparsity needs to be tuned to achieve rich network dynamics.We find that excessive sparsity (reduced connectivity) leads to weakened interaction between the liquid neurons and renders the liquid memoryless.On the contrary, lower sparsity (increased connectivity) results in chaotic spiking activity, which eliminates the separability in neuronal spiking activity.We initialize the connectivity matrices such that each excitatory neuron receives approximately C synaptic connections from inhibitory neurons, and vice versa.The hyperparameter C is tuned empirically as discussed in subsection 4.1 to avoid common chaotic spiking activity problems that occur when (1) excitatory neurons connect to each other and form a loop that always leads to positive drift in membrane potential, and when (2) an excitatory neuron connects to itself and repeatedly gets excited from its activity.Specifically, for the first situation, we have non-zero elements in the connectivity matrix E→E (denoted by W EE ) only at locations where elements in the product of connectivity matrices E→I and I→E (denoted by W EI and W IE , respectively) are non-zero.This ensures that excitatory synaptic connections are created only for those neurons that also receive inhibitory synaptic connections, which mitigates the possibility of continuous positive drift in the respective membrane potentials.To circumvent the second situation, we force the diagonal elements of W EE to be zero and eliminate the possibility of repeated self-excitation.Throughout this work, we create a recurrent connectivity matrix for liquid with m excitatory neurons and n inhibitory neurons by forming an m × n matrix whose values are randomly drawn from a uniform distribution between 0 and 1. Connection is formed between those pairs of neurons where the corresponding matrix entries are lesser than the target connection probability (= C/m).For illustration, consider a liquid with m=1000 excitatory and n=250 inhibitory neurons.In order to create the E→I connectivity matrix such that each inhibitory neuron receives synaptic connection from a single excitatory neuron (C=1), we first form a 1000 × 250 random matrix whose values are drawn from a uniform distribution between 0 and 1.We then create a connection between those pairs of neurons where the matrix entries are lesser than 0.1% (1/1000).Similar process is repeated for connection I→E.We then initialize connection E→E based on the product of W EI and W IE .Similarly, the connectivity matrix for I→I (denoted by W II ) is initialized based on the product of W IE and W EI .The connection weights are initialized from a uniform distribution between 0 and β as shown in Table 3 for different recurrent connectivity matrices.Note that the weights of the synaptic connections from inhibitory neurons are greater than that for synaptic connections from excitatory neurons to account for the lower number of inhibitory neurons relative to excitatory neurons.Stronger inhibitory connection weights help ensure that every neuron receives similar amount of excitatory and inhibitory input currents, which improves the stability of the liquid as experimentally validated in subsection 4.1.
The liquid-excitatory neurons are fully-connected to artificial rate-based neurons in the readout layer for inference.The readout layer, which consists of as many output neurons as the number of actions for a given RL task, uses the average firing rate/activation of the excitatory neurons to predict the Q-value for every state-action pair.We translate the liquid spiking activity to average rate by accumulating the excitatory neuronal spikes over the time period for which the input (current environment state) is presented.We then normalize the spike counts with the maximum possible spike count over the LSM-simulation period, which is computed as the LSM-simulation period divided by the simulation time-step, to obtain the average firing rate of the excitatory neurons that are fed to the readout layer.Since the number of excitatory neurons is larger than the number of output neurons in the readout layer, we gradually reduce the dimension by introducing an additional fully-connected hidden layer between the liquid and the output layer.We use ReLU non-linearity [31] after the first hidden layer but none after the final output layer since the Q-values are unbounded and can assume positive or negative values.We train the synaptic weights constituting the fully-connected readout layer using the Q-learning based training methodology that is described in the following subsection 3.2.

Q-Learning Based LSM Training Methodology
Reinforcement Learning (RL) tasks fundamentally involve an agent (for instance, a robot) that is trained to navigate a certain environment (for instance, a maze) in a manner that maximizes the total rewards in the future.Formally, at any time instant t, the agent receives the environment state s t and picks action a t from the set of all possible actions.After the environment receives the action a t , it transitions to the next state based on the chosen action and feeds back an immediate reward r t+1 and the new environment state s t+1 .As mentioned in the beginning, the goal of the agent is to maximize the accumulated reward in the future, which is mathematically expressed as where γ ∈ [0, 1] is the discount factor that determines the relative significance attributed to immediate and future reward.If γ is chosen to be 0, the agent maximizes only the immediate reward.However, as γ approaches unity, the agent learns to maximize the accumulated reward in the future.Q-learning [22] is a widely used RL algorithm that enables the agent to achieve this objective by computing the state-action value function (or commonly known as the Q-function), which is the expected future reward for a state-action pair that is specified by where Q π (s, a) measures the value of choosing an action a when in state s following a policy π.If the agent follows the optimal policy (denoted by π * ) such that Q π * (s, a) = max π Q π (s, a), the Q-function can be estimated recursively using the Bellman optimality equation that is described by where Q π * (s, a) is the Q-value for choosing action a from state s following the optimal policy π * , r t+1 is the immediate reward received from the environment, Q π * (s t+1 , a t+1 ) is the Q-value for selecting action a t+1 from the next environment state s t+1 .Learning the Q-values for all possible state-action pairs is intractable for practical RL applications.Popular approaches approximate Q-function using deep convolutional neural networks [19,32,33,34].
In this work, we model the agent using an LSM, wherein the liquid-to-readout weights are trained to approximate the Q-function as described below.At any time instant t, we map the current environment state vector s t to input neurons firing at a rate constrained between 0 and φ Hz over certain time period (denoted by T LSM ) following a Poisson process.The maximum Poisson firing rate φ is tuned to ensure sufficient input spiking activity for a given RL task.We follow the method outlined in [35] to generate the Poisson spike trains as explained below.For a particular input neuron in the state vector, we first compute the probability of generating a spike at every LSM-simulation time-step based on the corresponding Poisson firing rate.Note that the time-steps in the RL task are orthogonal to the time-steps used for the numerical simulation of the liquid.Specifically, in-between successive time-steps t and t + 1 in the RL task, the liquid is simulated for a time period of T LSM with 1ms separation between consecutive LSM-simulation time-steps.The probability of producing a spike at any LSM-simulation time-step is obtained by scaling the corresponding firing rate by 1,000.We generate a random number drawn from a uniform distribution between 0 and 1, and produce a spike if the random number is lesser than the neuronal spiking probability.At every LSM-simulation time-step, we feed the spike map of the current environment state and record the spiking outputs of the liquid-excitatory neurons.We accumulate the excitatory neuronal spikes and normalize the individual neuronal spike counts with the maximum possible spike count over the LSM-simulation period to obtain the high-dimensional representation (activation) of the environment state as discussed in the previous subsection 3.1.It is important to note that appropriate initialization of the LSM (detailed in subsection 3.1) is necessary to obtain useful high-dimensional representation for efficient training of the liquid-to-readout weights as experimentally validated in section 4.
The high-dimensional liquid activations are fed to the readout layer that is trained using backpropagation to approximate the Q-function by minimizing the mean square error between the Q-values predicted by the readout layer and the target Q-values following [19] as described by the following equations: where θ t+1 and θ t are the updated and previous synaptic weights in the readout layer, respectively, η is learning rate, Q(s t , a t |θ t ) is vector representing the Q-values predicted by the readout layer for all possible actions given the current environment state s t using the previous readout weights, ∇ θt Q(s t , a t |θ t ) is the gradient of the Q-values with respect to the readout weights, and Y t is the vector containing the target Q-values that is obtained by feeding the next environment state s t+1 to the LSM while using the previous readout weights.To encourage exploration during training, we follow -greedy policy [36] for selecting the actions based on the Q-values predicted by the LSM.Based on -greedy policy, we select a random action with probability and the optimal action, i.e., the action pertaining to the highest Q-value with probability (1− ) during training.Initially, is set to a large value (closer to unity), thereby permitting the agent to pick a lot of random actions and effectively explore the environment.As training progresses, gradually decays to a small value, thereby allowing the agent to exploit its past experiences.During evaluation, we similarly follow -greedy policy albeit with much smaller so that there is a strong bias towards exploitation.Employing -greedy policy during evaluation also serves to mitigate the negative impact of over-fitting or under-fitting.In an effort to further improve stability during training and achieve better generalization performance, we use the experience replay technique proposed by [19].Based on experience replay, we store the experience discovered at each time-step (i.e.s t , a t , r t , and s t+1 ) in a large table and later train the LSM by sampling mini-batches of experiences in a random manner over multiple training epochs, leading to improved generalization performance.For all the experiments reported in this work, we use the RMSProp algorithm [37] as the optimizer for error backpropagation with mini-batch size of 32.We adopt -greedy policy, wherein gradually decays from 1 to 0.001−0.1 over the first 10% of the training steps.Replay memory stores one million recently played frames, which are then used for mini-batch weight updates that are carried out after the initial 100 training steps.The simulation parameters for Q-learning are summarized in Table 5.

Experimental Results
We first present results motivating the importance of careful LSM initialization for obtaining rich highdimensional state representation, which is necessary for efficient training of the liquid-to-readout weights.We then demonstrate the utility of the recurrent-liquid synaptic connections of careful LSM initialization using classic cartpole-balancing RL task [20].We then validate the capability of appropriately initialized LSM, trained using the presented methodology, for solving complex RL tasks like Pacman [23] and Atari games [24].

LSM Hyperparameter Tuning
Initializing LSM with appropriate parameters is an important step to construct a model that produces useful high-dimensional representations.Since the input-to-liquid and recurrent-liquid connectivity matrices of the LSM are fixed a priori during training, how these connections are initialized dictates the liquid dynamics.We choose the parameters K (governing the input-to-liquid connectivity matrix) and C (governing the recurrentliquid connectivity matrices) empirically based on three observations: (1) stable spiking activity of the liquid, (2) eigenvalue analysis of the recurrent connectivity matrices, and (3) development of liquid-excitatory neuron membrane potential.Spiking activity of the liquid is said to be stable if every finite stream of inputs results in a finite period of response.Sustained activity indicates that small input noise can perturb the liquid state and lead to chaotic activity that is no longer dependent on the input stimuli.It is impractical to analyze the stability of the liquid for all possible input streams within a finite time.We investigate the liquid stability by feeding in random input stimuli and sampling the excitatory neuronal spike counts at regular time intervals over the LSM-simulation period for different values of K and C. We separately adjust these parameters for each learning task using random representations of the environment from the games.Values of K and C are experimentally determined to be 3 and 4 for cartpole and Pacman experiment, respectively, which ensures stable liquid spiking activity while enabling the liquid to exhibit fading memory of the past inputs.Fading memory indicates that the liquid retains input information for a short period of time after the input stimuli are cut-off.
Analyzing the eigenvalue spectrum of the recurrent connectivity matrix is another tool to assess the stability of the liquid.Each eigenvalue in the spectrum represents an individual mode of the liquid.Real part indicates decay rate of the mode while the imaginary part corresponds to the frequency of the mode [38].Liquid spiking activity remains stable as long as all eigenvalues remain within the unit circle.However, this condition is not easily met for realistic recurrent-liquid connections with random synaptic weight initialization [39].We constrain the recurrent weights (hyperparameter β) such that each neuron receives balanced excitatory and inhibitory synaptic currents as previously discussed in subsection 3.1.This results in eigenvalues that lie within the unit circle as illustrated in Figure 2(A).In order to emphasize the importance of LSM initialization, we also show the eigenvalue spectrum of the recurrent-liquid connectivity matrix when the weights are not properly initialized as shown in Figure 2(B) where many eigenvalues are outside the unit circle.Finally, we also use the development of the excitatory neuronal membrane potential to guide hyperparameter tuning.The hyperparameters C and β are chosen to ensure that membrane potential exhibits balanced fluctuation as illustrated in Figure 2(C) that plots the membrane potential of 10 randomly picked neurons in the liquid.3 based on hyperparameter C=4.Random representation from the cartpole-balancing problem is used as the input.

Learning to Balance a Cartpole
Cartpole-balancing is a classic control problem wherein the agent has to balance a pole attached to a wheeled cart that can move freely on a rail of certain length as shown in Figure 3(A).The agent can exert a unit force on the cart either to the left or right side for balancing the pole and keeping the cart within the rail.The environment state is characterized by cart position, cart velocity, angle of the pole, and angular velocity of the pole, which are designated by the tuple (χ, χ, ϕ, φ).The environment returns a unit reward every time-step and concludes after 200 time-steps if the pole does not fall or the cart does not goes out of the rail.Because the game is played for a finite time period, we constrain (χ, χ, ϕ, φ) to be within the range specified by (±2.5, ±0.5, ±0.28, ±0.88) for efficiently mapping the real-valued state inputs to spike trains feeding into the LSM.Each real-valued state input is mapped to 10 input neurons which have firing rates proportional to one-hot encoding of the input value representing 10 distinct levels within the corresponding range.We model the agent using an LSM containing 150 liquid neurons, 32 hidden neurons in the fully-connected layer between the liquid and output layer, and 2 output neurons.The maximum firing rate for the input neurons representing the environment state is set to 100 Hz.The LSM is trained for 10 5 time-steps, which are equally divided into 100 training epochs containing 1,000 time-steps per epoch.After each epoch, the LSM is evaluated for 1,000 time-steps with the probability of choosing a random action set to 0.05.Note that the LSM is evaluated for 1,000 time-steps (multiple gameplays) even though single gameplay lasts a maximum of only 200 time-steps as mentioned in the previous paragraph.We use the accumulated reward averaged over multiple gameplays as the true indicator of the LSM (agent) performance to account for the randomness in action-selection as a result of the -greedy policy.We train the LSM initialized with 10 different random seeds and obtain median accumulated reward as shown in Figure 3(B).Note that the maximum possible accumulated reward per gameplay is 200 since each gameplay lasts at most 200 time-steps.Increase in median accumulated reward over epochs indicates that the LSM learnt to balance the cartpole using the dynamically evolving high-dimensional liquid states.The ability of the liquid to provide rich high-dimensional input representations can be attributed to the careful initialization of the connectivity matrices and weights (explained in subsection 3.1), which ensures balance between the excitatory and inhibitory currents to the liquid neurons and preserves fading memory of past liquid activity.However, the median accumulated reward after 100 training epochs saturates around 125 and does not reach the maximum value of 200.We hypothesize that the game score saturation comes from the quantized representation of the environment state, and demonstrate in the following experiment with Pacman that the LSM can learn optimally given a better state representation.Finally, in order to emphasize the importance of LSM initialization, we also show the median accumulated reward per training epoch for training in which the LSM is initialized to have few synaptic connections.Figure 3(C) indicates that the median accumulated reward is around 90 when the LSM initialization is suboptimal.
visualize the learnt action-value function guiding action selection, we compare Q-values produced by the LSM during evaluation in three different scenarios depicted in Figure 3(D).Note that each Qvalue represents how good is the corresponding action for a given environment state.In scenario 1 (see Figure 3(D)-1) that corresponds to the beginning of the gameplay wherein the pole is almost balanced, the value of both the actions are identical.This implies that either action (moving the cart left or right) will lead to a similar outcome.In scenario 2 (see Figure 3(D)-2) wherein the pole is unbalanced to the left side, the difference between the predicted Q values increases.Specifically, the Q value for applying a unit force on the right side of the cart is higher, which causes the cart to move to the left.Pushing the cart to the left in turn causes the pole to swing back right towards the balanced position.Similarly, in scenario 3 (see Figure 3(D)-3) wherein the pole is unbalanced to the right side, the Q value is higher for applying a unit force on the left side of the cart, which causes the cart to move right and enables the pole to swing left towards the balanced position.This visually demonstrates the ability of the LSM (agent) to successfully balance the pole by pushing the cart appropriately to the left or right based on the learnt Q values.In order to comprehensively validate the efficacy of the high-dimensional environment representations provided by the liquid, we train the LSM to play a game of Pacman [23].The objective of the game is to control Pacman (yellow in color) to capture all the foods (represented by small white dots) in a grid without being eaten by the ghosts as illustrated in Figure 4.The ghosts always hunt the Pacman; however, cherry (represented by large white dots) make the ghosts temporarily scared of the Pacman and run away.The game environment returns unit reward whenever Pacman consumes food, cherry, or the scared ghost (white in color).The game environment also returns a unit reward and restarts when all foods are captured.We use the location of Pacman, food, cherry, ghost and scared ghost as the environment state representation.The location of each object is encoded as a two-dimensional binary array whose dimension matches with that of the Pacman grid as shown in Figure 4.The binary intermediate representations of all the objects are then concatenated and flattened into a single vector to be fed to the input layer of the LSM.The LSM configurations and game settings used for Pacman experiments are summarized in Table 1, where each game setting has different degree of complexity with regards to the Pacman grid size and the number of foods, ghosts, and cherries.In the first experiment, we use a 7 × 7 grid with 3 foods for Pacman to capture and a single ghost to prevent it from achieving its objective.Thus, the maximum possible accumulated reward at the end of a successful game is 4. Figure 5(A) shows that the median accumulated reward gradually increases with the number of training epochs and converges closer to the maximum possible reward, thereby validating the capability of the liquid to provide useful high-dimensional representation of the environment state necessary for efficient training of the readout weights using the presented methodology.Interestingly, in the second experiment using a larger 7 × 17 grid, we find that the median reward converges to 12, which is greater than the number of foods available in the grid.This indicates that the LSM does not only learn to capture all the foods; in addition, it also learns to capture the cherry and the scared ghosts, leading to further increase the accumulated reward since consuming the scared ghost results in a unit immediate reward.In the final experiment, we train the LSM to control Pacman in 17 × 19 grid with sparsely dispersed foods.We find that larger grid requires more exploration and training steps for the agent to perform well and achieve the maximum possible reward, resulting in a learning curve that is less steep compared to that obtained for smaller grid sizes in the earlier experiments as shown in Figure 5(C).
Finally, we plot the average of Q-values produced by the LSM as the Pacman navigates the grid to visualize the correspondence between the learnt Q-values and the enviroment state.As discussed in subsection 3.2, each Q-value produced by the LSM provides a measure of how good is a particular action for a give environment state.The Q-value averaged over the set of all possible actions (known as the state-value function) thus indicates the value of being in a certain state.Figure 5(D) illustrates the state-value function while playing the Pacman game in a 7×17 grid.The predicted state-value starts at a relatively high level because the foods are abundant in the grid and the ghosts are far away from the Pacman (see Figure 5(D)-1).The state-value gradually decreases as the Pacman navigates through the grid and gets closer to the ghosts.The predicted state-value then shoots up after the Pacman consumes cherry and makes the ghosts temporarily consumable (see Figure 5(D)-2), leading to potential additional reward.The predicted state-value drops after the ghosts are reborn (see Figure 5(D)-3).Finally, we observe a slight increase in the state-value towards the end of the game when the Pacman is closer to the last food after it consumes a cherry (see Figure 5(D)-4).It is interesting to note that although the scenario in Figure 5(D)-4 is similar to that in Figure 5(D)-2, the state-value is smaller since the expected accumulated reward at this step is at most 3 assuming that the Pacman can capture both the scared ghost and the last food.On the other hand, in the environment state shown in Figure 5(D)-2, the expected accumulated reward is greater than 3 since 4 foods and 2 scared ghosts are available for the Pacman to capture.

Learning to Play Atari Games
Finally, we train the LSM using the presented methodology to play Atari games [24], which are widely used to benchmark deep reinforcement learning networks.We arbitrarily select 4 games for evaluation, namely, Boxing, Gopher, Freeway, and Krull.We use the RAM of the Atari machine, which stores 128 bytes of information about an Atari game, as a representation of the environment [24].During training, we modified the reward structure of the game by clipping all positive immediate rewards to 1 and all negative immediate rewards to −1.However, we do not clip the immediate reward during testing and measure the actual accumulated reward following [19].For all selected Atari games, we model the agent using an LSM containing 500 liquid neurons and 128 hidden neurons.Number of output neurons varies for each game as the number of possible actions is different.The maximum Poisson firing rate for the input neurons is set to 100 Hz.The LSM is trained for 5 × 10 3 steps.Figure 6 illustrates that the LSM learnt the optimal strategies to play Boxing and Krull without any prior knowledge of the rules, leading to high accumulated reward towards the end of the training.However, learning in Gopher and Freeway progresses relatively slow.For detailed evaluation, we compare the median accumulated reward obtained from playing with the trained LSM to the average accumulated reward obtained from playing with random actions for 1 × 10 5 steps.We also compare the accumulated reward with that reported for human players in [19].Table 2 shows that the LSM achieves better score than human players on Boxing and Krull while comparable albeit lower score on Freeway and Gopher.

Discussion
LSM, an important class of biologically plausible recurrent SNNs, has thus far been primarily demonstrated for pattern (speech/image) recognition [12,13], gesture recognition [14,15], and sequence generation tasks [16,17,18] using standard datasets.To the best of our knowledge, our work is the first demonstration of LSMs, trained using Q-learning based methodology, for complex RL tasks like Pacman and Atari games commonly used to evaluate deep reinforcement learning networks.The benefits of the proposed LSM-based RL framework over the state-of-the-art deep learning models are two-fold.First, LSM entails fewer trainable parameters as a result of using fixed input-to-liquid and recurrent-liquid synaptic connections.However, this requires careful initialization of the respective matrices for efficient training of the liquid-to-readout weights as experimentally validated in section 4. We note that the stability of LSMs could be further enhanced by training the recurrent weights using localized Spike Timing Dependent Plasticity based learning rules [ 40,41,42], which incur lower computational complexity compared to the backpropagation-through-time algorithm [43,12] used for training recurrent SNNs.Second, LSMs can be efficiently implemented on event-driven neuromorphic hardware like IBM TrueNorth [28] or Intel Loihi [29], leading to potentially much improved energy efficiency while achieving comparable performance to deep learning models on the chosen benchmark tasks.Note that the readout layer in the presented LSM needs to be implemented outside the neuromorphic fabric since they are composed of artificial rate-based neurons that are typically not supported in neuromorphic hardware realizations.Alternatively, readout layer composed of spiking neurons could be used that can be trained using spike-based error backpropagation algorithms [44,45,46,47,48,18].Future works could also explore STDP-based reinforcement learning rules [49,50,51,52] to render the training algorithm amenable for neuromorphic hardware implementations.

Conclusion
Liquid State Machine (LSM) is a bio-inspired recurrent spiking neural network composed of an input layer sparsely connected to a randomly interlinked liquid of spiking neurons for the real-time processing of spatiotemporal inputs.In this work, we proposed LSMs, trained using the presented Q-learning based methodology, for solving complex Reinforcement Learning (RL) tasks like playing Pacman and Atari that have been hitherto benchmarked for deep reinforcement learning networks.We presented initialization strategies for the fixed input-to-liquid and recurrent-liquid synaptic connectivity matrices and weights to enable the liquid to produce useful high-dimensional representation of the environment state necessary for efficient training of the liquid-to-readout weights.We demonstrated the significance of the inherent capability of the liquid to produce rich representation by training the LSM to successfully balance a cartpole.Our experiments on the Pacman game showed that the LSM learns the optimal strategies for different game settings and grid sizes.
Our analyses on a subset of Atari games indicated that the LSM achieves comparable score to that reported for human players in existing works.Weight decay for RMSProp algorithm 0 Smoothing constant for RMSProp algorithm 0.99

Figure 2 :
Figure 2: Metrics for guiding hyperparameter tuning: (A) Eigenvalue spectrum of the recurrent-liquid connectivity matrix for an LSM containing 500 liquid neurons.The LSM is initialized with synaptic weights listed in Table 3 based on hyperparameter C=4.All eigenvalues in the spectrum lie within a unit circle.(B) Eigenvalue spectrum of the recurrent-liquid connectivity matrix initialized with synaptic weights β E→E = 0.4, β E→I = 0.1, and β I→E = 0.1.Many eigenvalues in the spectrum are outside the unit circle.(C) Development of membrane potentials from 10 randomly picked excitatory neurons in the liquid initialized with synaptic weights listed in Table3based on hyperparameter C=4.Random representation from the cartpole-balancing problem is used as the input.

Figure 3 :
Figure 3: (A) Illustration of the cartpole-balancing task wherein the agent has to balance a pole attached to a wheeled cart that moves freely on a rail of certain length.(B) The median accumulated reward per epoch provided by the LSM trained across 10 different random seeds for the cartpole-balancing task.Shaded region in the plot represents the 25-th to 75-th percentile of the accumulated reward over multiple random seeds.(C) The median accumulated reward per epoch from cartpole training across 10 different random seeds in which the LSM is initialized to have sparser connectivity between the liquid neurons compared to that used for the experiment in (B).(D) Visualization of the learnt Q (action-value) function for the cartpole-balancing task at three different game-steps designated as 1, 2, and 3. Angle of the pole is written on the left side of each figure.Negative angle represents an unbalanced pole to the left and positive angle represents an unbalanced pole to the right.Black arrow corresponds to a unit force on the left or right side of the cart depending on which Q value is larger.

Figure 4 :
Figure 4: Illustration of a snapshot from the Pacman game that is translated into 5 two-dimensional binary representations corresponding to the location of Pacman, foods, cherries, ghosts, and scared ghosts.The binary intermediate representations are then flattened and concatenated to obtain the environment state representation.

Figure 5 :
Figure 5: Median accumulated reward per epoch obtained by training and evaluating the LSM in 3 different game settings: (A) grid size 7 × 7, (B) grid size 7 × 17, and (C) grid size 17 × 19.LSM is initialized and trained with 7 different initial random seeds.Shaded region represents the 25-th to 75-th percentile of the accumulated reward over multiple seeds.(D) The plot on the left shows the predicted state-value function for 80 continuous Pacman game steps.The four snapshots from the Pacman game shown on the right correspond to game steps designated as 1, 2, 3, and 4, respectively, in the state-value plot.

Figure 6 :
Figure 6: Median accumulated reward per epoch obtained by training and evaluating the LSM for 4 selected Atari games: (A) Boxing, (B) Freeway, (C) Gopher, and (D) Krull.For each game, LSM is initialized and trained with 5 different initial random seeds.Shaded region represents the 25-th to 75-th percentile of the accumulated reward over multiple seeds.

Table 1 :
LSM configuration and game settings for different Pacman experiments reported in this work.Grid size Ghost Food Cherry Training steps Liquid neurons Hidden neurons

Table 2 :
[19]arison between median accumulated rewarded over multiple random seeds, average accumulated reward from playing with random actions, and accumulated reward from human game tester reported in[19].Best median accumulated reward over the last 10 training epochs is reported for each game.

Table 3 :
Synaptic weight initialization parameters for the fixed LSM connections.

Table 4 :
Leaky-Integrate-and-Fire (LIF) model parameters for the liquid neurons.