Realistic Actor-Critic: A framework for balance between value overestimation and underestimation

Introduction The value approximation bias is known to lead to suboptimal policies or catastrophic overestimation bias accumulation that prevent the agent from making the right decisions between exploration and exploitation. Algorithms have been proposed to mitigate the above contradiction. However, we still lack an understanding of how the value bias impact performance and a method for efficient exploration while keeping stable updates. This study aims to clarify the effect of the value bias and improve the reinforcement learning algorithms to enhance sample efficiency. Methods This study designs a simple episodic tabular MDP to research value underestimation and overestimation in actor-critic methods. This study proposes a unified framework called Realistic Actor-Critic (RAC), which employs Universal Value Function Approximators (UVFA) to simultaneously learn policies with different value confidence-bound with the same neural network, each with a different under overestimation trade-off. Results This study highlights that agents could over-explore low-value states due to inflexible under-overestimation trade-off in the fixed hyperparameters setting, which is a particular form of the exploration-exploitation dilemma. And RAC performs directed exploration without over-exploration using the upper bounds while still avoiding overestimation using the lower bounds. Through carefully designed experiments, this study empirically verifies that RAC achieves 10x sample efficiency and 25% performance improvement compared to Soft Actor-Critic in the most challenging Humanoid environment. All the source codes are available at https://github.com/ihuhuhu/RAC. Discussion This research not only provides valuable insights for research on the exploration-exploitation trade-off by studying the frequency of policies access to low-value states under different value confidence-bounds guidance, but also proposes a new unified framework that can be combined with current actor-critic methods to improve sample efficiency in the continuous control domain.

For all training instances, the policies are evaluated every R eval = 10 3 time steps. The agent fixes its policy at each evaluation phase and deterministically interacts with the same environment, separate from obtaining 10 episodic rewards. The mean and standard deviation of these 10 episodic rewards is the performance metrics of the agent at the evaluation phase.
In the case of RAC, we employ a discrete number H of values {β i } H i=1 to get H policies: Each H policy fixes its policy at the evaluation phase and deterministically interacts with the environment with the fixed policy to obtain 10 episodic rewards. First, the 10 episodic rewards are averaged for each policy, and then the maximum of the 10-episode-average rewards of the H policies is taken as the performance at that evaluation phase.
We performed this operation for 8 different random seeds used in the computational packages(NumPy (Van Der Walt et al., 2011), PyTorch (Paszke et al., 2019) and environments(OpenAI gym (Brockman et al., 2016)). The mean and standard deviation of the learning curve are obtained from these 8 simulations.

A.2 The normalized value bias estimation
Given a state-action pair, the normalized value bias is defined as: where • Q π (s, a) be the action-value function for policy π using the standard infinite-horizon discounted Monte Carlo return definition. •Q θ (s, a) the estimated Q-value, defined as the mean of Q θ i (s, a), i = 1, . . . , N .
For RAC, the normalized value bias is defined as: where • π * is the best-performing policy in the evaluation among H policies A.1.
• Q π * (s, a) be the action-value function for policy π * using the standard infinite-horizon discounted Monte Carlo return definition. •Q θ (s, a, β * ) the estimated Q-value using Q θ of β * which corresponds to the policy π * , defined as the mean of Q θ i (s, a, β * ), i = 1, . . . , N .
To get various target state-action pairs, we first execute the policy in the environment to obtain 100 state-action pairs and then sample the target state-action pair without repetition. Starting from the target state-action pair, run the Monte Carlo processes until the max step limit is reached.

B HYPERPARAMETERS AND IMPLEMENTATION DETAILS
We implement all RAC algorithms with Pytorch (Paszke et al., 2019) and use Ray[tune] (Liaw et al., 2018) to build and run distributed applications. For all the algorithms and variants, we first obtain 5000 data points by randomly sampling actions from the action space without making any parameter updates. Then, to stabilize the early learning of critics, a linear learning rate warm-up strategy is applied to critics in the start stage of training for RAC and its variants: where l is current learning rate, l init is the initial value of the learning rate, l target is the target value of the learning rate, t start is the time steps to start adjusting the learning rate, t target is the time steps to arrive at l target .
For all RAC algorithms and variants, We parameterize both the actor and critics with feed-forward neural networks with 256 and 256 hidden nodes, respectively, with rectified linear units (ReLU) (Nair and Hinton, 2010) between each layer. β is log-scaled before input into actors and critics. In order to prevent β sample from being zero, a small value ε = 10 −7 is added to the left side of U 1 and U 2 to be U Weights of all networks are initialized with Kaiming Uniform Initialization (He et al., 2015), and biases are zero-initialized. We normalize actions to a range of [−1, 1] for all environments.

B.1 RAC-SAC algorithm
Here, the policy is modeled as a Gaussian with mean and covariance given by neural networks to handle continuous action spaces. The way RAC optimizes the policy makes use of the reparameterization trick (Kingma and Welling, 2013;Haarnoja et al., 2018), in which a sample is drawn by computing a deterministic function of the state, policy parameters, and independent noise: The actor network outputs the Gaussian's means and log-scaled covariance, and the log-scaled covariance is clipped in a range of [−10, 2] to avoid extreme values. Then, the actions are bounded to a finite interval by applying an invertible squashing function (tanh) to the Gaussian samples, and the Squashed Gaussian Trick (Haarnoja et al., 2018) calculates the log-likelihood of actions.
The temperature is parameterized by a one-layer feedforward neural network T ψ of 64 with rectified linear units (ReLU). To prevent temperature be negative, we parameterize temperature as: where ξ is constant controlling the initial temperature, log(β) is log-scaled β, T ψ (log(β)) is the output of the neural network.
The pseudocode for RAC-TD3 is shown in Algorithm 1.

B.3 Vanilla RAC algorithm
UVFA is not needed for vanilla RAC as β is a constant. The actor is updated by minimizing the following object: L vanillaRAC actor (ϕ) = E s∼B,a∼π ϕ α log π ϕ (a | s) −Q θ (a, s) . (S11) The pseudocode for Vanilla RAC is shown in Algorithm 2.

B.4 RAC with in-target minimization
We implement RAC with in-target minimization referring to authors's code https://github.com/ watchernyu/REDQ. The critics and actor are extended as Q θ i (s, a, k) and π ϕ (· | s ′ , k), U 1 is a uniform traning distribution U[1, a], a > 1, k ∼ U 1 that determine the size of the random subset M. When k is 1: Initialize actor network ϕ, N critic networks θ i , i = 1, . . . , N , empty replay buffer B, target network θ i ←− θ i , for i = 1, 2, . . . , N 2: for each iteration do 3: execute an action a ∼ π ϕ (· | s). Update α by minimize L temp not an integer, the size of M will be sample beweetn floor(k) and floor(k + 1) according to the Bernoulli distribution B(p) with parameter p = k − floor(k), where floor is a round-towards-zero operator.
An independent temperature network α ψ parameterized by ψ is updated with the following object: In-target minimization is used to calculate the target y: Then each Q θ i (s, a, k) is updated with the same target: The extended policy π ϕ is updated by minimizing the following object: When interacting with the environment, obtaining exploration behaviors by sample k from exploration distribution The pseudocode for RAC with in-target minimization is shown in Algorithm 3. Algorithm 3 RAC with in-target minimization 1: Initialize actor network ϕ, N critic networks θ i , i = 1, . . . , N , temperature network ψ, empty replay buffer B, target networkθ i ←− θ i , for i = 1, 2, . . . , N , uniform distribution U 1 and U 2 2: for each iteration do 3: execute an action a ∼ π ϕ (· | s, k) , k ∼ U 2 . Compute the Q target (S13) 12:

C VISUALISATIONS
learned temperatures. The Figure S1 shows the visualization of learned temperatures concerning different β during training. The figure demonstrates that learned temperatures are quite different. It is challenging to consider the temperature of different β with a single temperature. RAC-in-target replay buffer capacity 3 × 10 5 10 5 2 × 10 5 1 × 10 6 1 × 10 6 1 × 10 6 3 × 10 −4 initial critic learning rate (l init ) 3 × 10 −5 target critic learning rate (l target ) 3 × 10 −4 time steps to start learning rate adjusting (t start ) 5000 time steps to reach target learning rate (t target ) 10 4 number of hidden layers (for ϕ and θ i ) 2 number of hidden units per layer (for ϕ and θ i ) 256 number of hidden layers (for T ψ ) 1 number of hidden units per layer (for T ψ ) 64 discount (γ) 0.99 nonlinearity ReLU evaluation frequence 10 3 minibatch size 256 target smoothing coefficient (ρ) 0.005 Update-To-Data (UTD) ratio (G) 20 ensemble size (N ) 10 number of evaluation episodes 10 initial random time steps 5000 frequence of delayed policy updates 1 log-scaled covariance clip range [−10, 2] number of discrete policies for evaluation (H) 12