^{*}

^{†}

^{†}

Edited by: Emre O. Neftci, University of California, Irvine, United States

Reviewed by: Sadique Sheik, AiCTX AG, Switzerland; Arash Ahmadi, University of Windsor, Canada

This article was submitted to Neuromorphic Engineering, a section of the journal Frontiers in Neuroscience

†These authors have contributed equally to this work

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

We propose reinforcement learning on simple networks consisting of random connections of spiking neurons (both recurrent and feed-forward) that can learn complex tasks with very little trainable parameters. Such sparse and randomly interconnected recurrent spiking networks exhibit highly non-linear dynamics that transform the inputs into rich high-dimensional representations based on the current and past context. The random input representations can be efficiently interpreted by an output (or readout) layer with trainable parameters. Systematic initialization of the random connections and training of the readout layer using Q-learning algorithm enable such small random spiking networks to learn optimally and achieve the same learning efficiency as humans on complex reinforcement learning (RL) tasks like Atari games. In fact, the sparse recurrent connections cause these networks to retain fading memory of past inputs, thereby enabling them to perform temporal integration across successive RL time-steps and learn with partial state inputs. The spike-based approach using small random recurrent networks provides a computationally efficient alternative to state-of-the-art deep reinforcement learning networks with several layers of trainable parameters.

High degree of recurrent connectivity among neuronal populations is a key attribute of neural microcircuits in the cerebral cortex and many different brain regions (Douglas et al.,

In this work, we propose such sparse randomly-interlinked low-complexity LSMs for solving complex Reinforcement Learning (RL) tasks, which involve an autonomous agent (modeled using the LSM) trained to select actions in a manner that maximizes the expected future rewards received from the environment. For instance, a robot (agent) learning to navigate a maze (environment) based on the reward and punishment received from the environment is an example RL task. The environment state (converted to spike trains) is fed to the liquid, which produces a high-dimensional representation based on current and past inputs. The sparse recurrent connections enable the liquid to retain decaying memory of past input representations and perform temporal integration across different RL time-steps. We present an optimal initialization strategy for the fixed input-to-liquid and recurrent-liquid connectivity matrices and weights to enable the liquid to produce high-dimensional representations that lead to efficient training of the liquid-to-readout weights. Artificial rate-based neurons for the readout layer takes the liquid activations and produces

Liquid State Machine (LSM) consists of an input layer sparsely connected via fixed synaptic weights to a randomly interlinked liquid of spiking neurons followed by a readout layer as depicted in

where _{i} is the membrane potential of the _{rest} is the resting potential to which _{i} decays to, with time constant τ, in the absence of input current, and _{i}(_{P}, _{E}, and _{I} are the number of input, excitatory, and inhibitory neurons, respectively. The instantaneous current is a sum of three terms: current from input neurons, current from excitatory neurons, and current from inhibitory neurons. The first term integrates the sum of pre-synaptic spikes, denoted by δ(_{l}) where _{l} is the time instant of pre-spikes, with the corresponding synaptic weights (_{li} in 3). Likewise, the second (third) term integrates the sum of pre-synaptic spikes from the excitatory (inhibitory) neurons, denoted by δ(_{j}) (δ(_{k})), with the respective weights _{ji} (_{ki}) in 3. The neuronal membrane potential is updated with the sum of the input, excitatory, and negative inhibitory currents as shown in 1. When the membrane potential reaches a certain threshold _{thres}, the neuron fires an output spike. The membrane potential is thereafter reset to _{reset} and the neuron is restrained from spiking for an ensuing refractory period by holding its membrane potential constant. The LIF model hyperparameters for the excitatory and inhibitory neurons are listed in

Synaptic weight initialization parameters for the fixed LSM connections for learning to balance cartpole, play Pacman, and play Atari game.

[0, 0.6] | |

[0, 0.05] | |

[0, 0.25] | |

[0, 0.3] | |

[0, 0.01] |

Illustration of the LSM architecture consisting of an input layer sparsely connected via fixed synaptic weights to randomly recurrently connected reservoir (or liquid) of excitatory and inhibitory spiking neurons followed by a readout layer composed of artificial rate-based neurons.

Leaky-Integrate-and-Fire (LIF) model parameters for the liquid neurons.

_{rest} |
0 |

_{reset} |
0 |

_{thres} |
0.5 |

τ | 20 ms |

τ_{refrac} |
1 ms |

Δ |
1 ms |

There are four types of recurrent synaptic connections in the liquid, namely, _{EE}) only at locations where elements in the product of connectivity matrices _{EI} and _{IE}, respectively) are non-zero. This ensures that excitatory synaptic connections are created only for those neurons that also receive inhibitory synaptic connections, which mitigates the possibility of continuous positive drift in the respective membrane potentials. To circumvent the second situation, we force the diagonal elements of _{EE} to be zero and eliminate the possibility of repeated self-excitation. Throughout this work, we create a recurrent connectivity matrix for liquid with _{EI} and _{IE}. Similarly, the connectivity matrix for _{II}) is initialized based on the product of _{IE} and _{EI}. The connection weights are initialized from a uniform distribution between 0 and β as shown in

The liquid-excitatory neurons are fully-connected to artificial rate-based neurons in the readout layer for inference. The readout layer, which consists of as many output neurons as the number of actions for a given RL task, uses the average firing rate/activation of the excitatory neurons to predict the Q-value for every state-action pair. We translate the liquid spiking activity to average rate by accumulating the excitatory neuronal spikes over the time period for which the input (current environment state) is presented. We then normalize the spike counts with the maximum possible spike count over the LSM-simulation period, which is computed as the LSM-simulation period divided by the simulation time-step, to obtain the average firing rate of the excitatory neurons that are fed to the readout layer. Since the number of excitatory neurons is larger than the number of output neurons in the readout layer, we gradually reduce the dimension by introducing an additional fully-connected hidden layer between the liquid and the output layer. We use ReLU non-linearity (Nair and Hinton,

At any time instant _{t} and picks action _{t} from the set of all possible actions. After the environment receives the action _{t}, it transitions to the next state based on the chosen action and feeds back an immediate reward _{t+1} and the new environment state _{t+1}. As mentioned in the beginning, the goal of the agent is to maximize the accumulated reward in the future, which is mathematically expressed as

where γ ∈ [0, 1] is the discount factor that determines the relative significance attributed to immediate and future reward. If γ is chosen to be 0, the agent maximizes only the immediate reward. However, as γ approaches unity, the agent learns to maximize the accumulated reward in the future. Q-learning (Watkins and Dayan,

where _{π}(_{*}) such that

where _{π*}(_{*}, _{t+1} is the immediate reward received from the environment, _{π*}(_{t+1}, _{t+1}) is the Q-value for selecting action _{t+1} from the next environment state _{t+1}. Learning the Q-values for all possible state-action pairs is intractable for practical RL applications. Popular approaches approximate Q-function using deep convolutional neural networks (Lillicrap et al.,

In this work, we model the agent using an LSM, wherein the liquid-to-readout weights are trained to approximate the Q-function as described below. At any time instant _{t} to input neurons firing at a rate constrained between 0 and ϕ Hz over certain time period (denoted by _{LSM}) following a Poisson process. The maximum Poisson firing rate ϕ is tuned to ensure sufficient input spiking activity for a given RL task. We follow the method outlined in Heeger (_{LSM} with 1ms separation between consecutive LSM-simulation time-steps. The probability of producing a spike at any LSM-simulation time-step is obtained by scaling the corresponding firing rate by 1, 000. We generate a random number drawn from a uniform distribution between 0 and 1, and produce a spike if the random number is lesser than the neuronal spiking probability. At every LSM-simulation time-step, we feed the spike map of the current environment state and record the spiking outputs of the liquid-excitatory neurons. We accumulate the excitatory neuronal spikes and normalize the individual neuronal spike counts with the maximum possible spike count over the LSM-simulation period to obtain the high-dimensional representation (activation) of the environment state as discussed in the previous 2.1. Note that the liquid state variables, such as the neuronal membrane potentials are not reset between successive RL time-steps so that some information of the past environment representations are still retained. The capability of the liquid to retain decaying memory of the past representations enables it to perform temporal integration over different RL time-steps such that the high-dimensional representation provided by the liquid for the current environment state also depends on decaying memory of the past environment representations. However, it is important to note that appropriate initialization of the LSM (detailed in 2.1) is necessary to obtain useful high-dimensional representation for efficient training of the liquid-to-readout weights as experimentally validated in 3.

The high-dimensional liquid activations are fed to the readout layer that is trained using backpropagation to approximate the Q-function by minimizing the mean square error between the Q-values predicted by the readout layer and the target Q-values following (Mnih et al.,

where θ_{t+1} and θ_{t} are the updated and previous synaptic weights in the readout layer, respectively, η is learning rate, _{t}, _{t}|θ_{t}) is vector representing the Q-values predicted by the readout layer for all possible actions given the current environment state _{t} using the previous readout weights, ∇_{θt}_{t}, _{t}|θ_{t}) is the gradient of the Q-values with respect to the readout weights, and _{t} is the vector containing the target Q-values that is obtained by feeding the next environment state _{t+1} to the LSM while using the previous readout weights. To encourage exploration during training, we follow ϵ-greedy policy (Watkins, _{t}, _{t}, _{t}, and _{t+1}) in a large table and later train the LSM by sampling mini-batches of experiences in a random manner over multiple training epochs, leading to improved generalization performance. For all the experiments reported in this work, we use the RMSProp algorithm (Tieleman and Hinton,

Q-learning simulation parameters.

Readout weights update frequency | Once every game-step |

Warm up steps before training begins | 100 |

Batch size for experience replay | 32 |

Experience replay buffer size | 1 × 10^{6} |

Discount factor | 0.95 |

Initial exploration probability during training | 1 |

Final exploration probability during training (Cartpole) | 1 × 10^{−3} |

Final exploration probability during training (Pacman & Atari) | 1 × 10^{−1} |

Exploration probability during evaluation (Cartpole & Atari) | 5 × 10^{−2} |

Exploration probability during evaluation (Pacman) | 0 |

Learning rate for RMSProp algorithm | 2 × 10^{−4} |

Term added to denominator for RMSProp algorithm | 1 × 10^{−6} |

Weight decay for RMSProp algorithm | 0 |

Smoothing constant for RMSProp algorithm | 0.99 |

We first present results motivating the importance of careful LSM initialization for obtaining rich high-dimensional state representation, which is necessary for efficient training of the liquid-to-readout weights. We then demonstrate the utility of the recurrent-liquid synaptic connections of careful LSM initialization using classic cartpole-balancing RL task (Sutton and Barto,

Initializing LSM with appropriate hyperparameters is an important step to construct a model that produces useful high-dimensional representations. Since the input-to-liquid and recurrent-liquid connectivity matrices of the LSM are fixed

Spiking activity of the liquid is said to be stable if every finite stream of inputs results in a finite period of response. Sustained activity indicates that small input noise can perturb the liquid state and lead to chaotic activity that is no longer dependent on the input stimuli. It is impractical to analyze the stability of the liquid for all possible input streams within a finite time. We investigate the liquid stability by feeding in random input stimuli and sampling the excitatory neuronal spike counts at regular time intervals over the LSM-simulation period for different values of

Analyzing the eigenvalue spectrum of the recurrent connectivity matrix is a common tool to assess the stability of the liquid. Each eigenvalue in the spectrum represents an individual mode of the liquid. Real part indicates decay rate of the mode while the imaginary part corresponds to the frequency of the mode (Rajan et al.,

Metrics for guiding hyperparameter tuning: _{E→E} = 0.4, β_{E→I} = 0.1, and β_{I→E} = 0.1. Many eigenvalues in the spectrum are outside the unit circle.

Cartpole-balancing is a classic control problem wherein the agent has to balance a pole attached to a wheeled cart that can move freely on a rail of certain length as shown in

We model the agent using an LSM containing 150 liquid neurons, 32 hidden neurons in the fully-connected layer between the liquid and output layer, and two output neurons. The maximum firing rate for the input neurons representing the environment state is set to 100 Hz and each input is presented for 100 ms. The LSM is trained for 10^{5} time-steps, which are equally divided into 100 training epochs containing 1, 000 time-steps per epoch. After each epoch, the LSM is evaluated for 1, 000 time-steps with the probability of choosing a random action ϵ set to 0.05. Note that the LSM is evaluated for 1, 000 time-steps (multiple gameplays) even though single gameplay lasts a maximum of only 200 time-steps as mentioned in the previous paragraph. We use the accumulated reward averaged over multiple gameplays as the true indicator of the LSM (agent) performance to account for the randomness in action-selection as a result of the ϵ-greedy policy. We train the LSM initialized with 10 different random seeds and obtain median accumulated reward as shown in

To visualize the learnt action-value function guiding action selection, we compare Q-values produced by the LSM during evaluation in three different scenarios depicted in

In this sub-section, we demonstrate the capability of the LSM to learn without complete state information, thereby validating its ability to perform temporal integration across different RL game steps enabled by the sparse random recurrent connections. Specifically, we modify the previous cartpole-balancing task such that the agent only receives the cart position and angle of the pole, designated by tuple (χ, φ), as an input while the velocity information is ignored. The objective is to determine if the decaying memory of the past cart position and pole angle retained by the liquid, as a result of the recurrent-liquid connectivity, enables the LSM to make better decisions without the velocity information. We clip (χ, φ) to be within the range specified by (±2.5, ±0.28) similar to the previous experiment; however, each real-valued state input is mapped to only 1 input neuron whose firing rate is proportional to the normalized state value. A positive state input causes the corresponding neuron to fire unit positive spikes. On the other hand, if the state input is negative, the input neuron fires unit negative spikes at a rate proportional to the absolute value of the input as described in Sengupta et al. (

Synaptic weight initialization parameters for learning to balance cartpole without complete state information.

[0, 0.4] | |

[0, 0.4] | |

[0, 0.4] | |

[0, 0.4] | |

[0, 0.01] |

We model the agent using an LSM with 150 liquid neurons followed by a fully-connected layer with 32 hidden neurons and a final output layer with two neurons, which is similar to the architecture used for the previous cartpole-balancing experiment. Additional feedback connections between excitatory neurons that have a large delay of 20 ms are introduced to achieve long-term temporal integration over RL time-steps. In this experiment, we also reduced the LSM simulation time-steps to 20 ms from 100 ms used in the previous experiment to precisely validate the long-term temporal integration capability of the liquid. The LSM is trained for a total of 5 × 10^{6} time-steps, which is sufficiently long to guarantee no further improvement in performance. Without complete state information, the LSM achieves best median accumulated reward of 70.93 over the last 10 epochs as illustrated in

In order to comprehensively validate the efficacy of the high-dimensional environment representations provided by the liquid, we train the LSM to play a game of Pacman (DeNero et al.,

Illustration of a snapshot from the Pacman game that is translated into 5 two-dimensional binary representations corresponding to the location of Pacman, foods, cherries, ghosts, and scared ghosts. The binary intermediate representations are then flattened and concatenated to obtain the environment state representation.

The LSM configurations and game settings used for Pacman experiments are summarized in

LSM configuration and game settings for different Pacman experiments reported in this work.

7 × 7 | 1 | 3 | 0 | 5 × 10^{5} |
500 | 128 |

7 × 17 | 2 | 6 | 2 | 5 × 10^{5} |
2, 000 | 512 |

17 × 19 | 1 | 6 | 0 | 3 × 10^{6} |
3, 000 | 512 |

Median accumulated reward per epoch obtained by training and evaluating the LSM on three different game settings:

Finally, we plot the average of Q-values produced by the LSM as the Pacman navigates the grid to visualize the correspondence between the learnt Q-values and the environment state. As discussed in 2.2, each Q-value produced by the LSM provides a measure of how good is a particular action for a give environment state. The Q-value averaged over the set of all possible actions (known as the state-value function) thus indicates the value of being in a certain state.

Finally, we train the LSM using the presented methodology to play Atari games (Brockman et al., ^{3} steps.

^{5} steps. Note that the median accumulated reward used for comparison is the highest reward achieved during the evaluation phase over the last 10 training epochs.

Median accumulated reward per epoch obtained by training and evaluating the LSM on 4 selected Atari games:

Median accumulated reward for each game is chosen from the highest median accumulated reward over the last 10 training epochs across five different initial random seeds.

Boxing | 20.2 | 0.8 | 4.3 | 68.2 |

Freeway | 19.8 | 0.0 | 29.6 | 21.6 |

Gopher | 611.1 | 279.3 | 2, 321 | 1, 443 |

Krull | 3, 686 | 1, 590 | 2, 395 | 4, 672 |

^{5} steps, which is a sufficiently large number for the average accumulated reward to be stable. Accumulated reward from human players reported in Mnih et al. (

Convolutional deep learning network architecture used in Atari experiments.

One-dimensional convolutional | 4 | 4 | 2 | 1 | |

One-dimensional convolutional | 16 | 4 | 2 | 1 | |

Fully connected | 128 | ||||

Fully connected | 3 for Freeway | ||||

8 for Gopher | |||||

18 for Boxing and Krull |

LSM, an important class of biologically plausible recurrent SNNs, has thus far been primarily demonstrated for pattern (speech/image) recognition (Bellec et al.,

Liquid State Machine (LSM) is a bio-inspired recurrent spiking neural network composed of an input layer sparsely connected to a randomly interlinked liquid of spiking neurons for the real-time processing of spatio-temporal inputs. In this work, we proposed LSMs, trained using Q-learning based methodology, for solving complex Reinforcement Learning (RL) tasks like playing Pacman and Atari that have been hitherto benchmarked for deep reinforcement learning networks. We presented initialization strategies for the fixed input-to-liquid and recurrent-liquid connectivity matrices and weights to enable the liquid to produce useful high-dimensional representation of the environment based on the current and past input states necessary for efficient training of the liquid-to-readout weights. We demonstrated the significance of the sparse recurrent connections, which enables the liquid to retain decaying memory of the past input representations and perform temporal integration across RL time-steps, by training it using partial input state information that yielded higher accumulated reward than that provided by a liquid without recurrent connections. Our experiments on the Pacman game showed that the LSM learns the optimal strategies for different game settings and grid sizes. Our analyses on a subset of Atari games indicated that the LSM achieves comparable score to that reported for human players in existing works.

Publicly available datasets were analyzed in this study. This data can be found here:

GS and WP wrote the paper. WP performed the simulations. All authors helped with developing the concepts, conceiving the experiments, and writing the paper.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.