Target entropy in discrete SAC Implementation

Hello all!

I noticed a few recurrent issues with the discrete SAC implementation appearing here and there. Apologies if this exact topic appeared before, but I haven’t found any such discussions or issues.

Based on my initial experiments, it appears that there is a problem with the initialization of the target entropy, which is initialised as proposed in [2] as follows

 target_entropy = 0.98 * np.array(
      -np.log(1.0 / action_space.n), dtype=np.float32
  )

Is the default target entropy for discrete SAC too large?

For a two-action environment, this gives the value of entropy equal to 0.67, while the entropy for a uniform distribution (with p=0.5 for both actions) is equal to 0.69. If I understood correctly the authors of [1], the target entropy is meant to be the lower bound on the policy entropy. Here we are effectively trying to set the lower bound of the policy entropy to the entropy of the uniform distribution, which maximises the entropy. In my experiments, this results in \alpha growing with every iteration and makes the performance of the algorithm unstable.

A fix?

I think there is a fix, which follows the ideas in [1], where the authors consider the continuous case and set the target entropy to -1 * action_dim, where action_dim is the dimension of the action.

In the case of Gaussian action distribution centred around zero, the entropy can be computed as 0.5 * \log(\sigma^2 * 2 * \pi * e), which is equal to -0.88 for \sigma=0.1. This means that the lower bound on the policy entropy should be a Gaussian with a small variance. Following the same idea we can set the target entropy to

    epsilon = 0.975                
    target_entropy = -np.array(
                   # log probability of the greedy action
                   epsilon * np.log(epsilon) +
                   # log probability of random actions
                   (1-epsilon) * np.log((1-epsilon)/(action_space.n-1)), 
                   dtype=np.float32
                )     

we can ensure that our stochastic policy is at least as stochastic as an epsilon-greedy policy.

This idea works very well on the very simple CartPole-v0, but I don’t have the means to test it on more complex environments. Interestingly the current implementation of SAC fails to solve this environment.

My implementation of REDQ as SAC

My fix is implemented in REDQ algorithm PR. REDQ uses an ensemble of models but reduces to SAC if the ensemble size is equal to 2. The list of changes in comparison to SAC is as follows:

  1. Added ensemble of model parameters. Now we can choose N target value functions, Q functions as well as the way to aggregate them.
  2. The target values are updated using several critics instead of using one. In the policy loss function, the variable q_t is now also chosen as the minimum of two functions if the ensemble size is equal to two.
  3. Changed the target entropy to the entropy of an epsilon-greedy policy as above.

Experiments on CartPole-v0.

  1. REDQ with Ensemble of 2 (should be equal to SAC), the target entropy is 0.67
  2. SAC - current SAC implementation, the target entropy is 0.67
  3. REDQ with Ensemble of 2 (should be equal to SAC), the target entropy 0.11 (epsilon-greedy with 2.5% chance of random actions.

REDQ with the target entropy 0.11 learns cart pole well and fast. At the same time, it is apparent that for the target entropy is 0.67, the gradient of the alpha loss is almost always negative, which forces the alpha parameter to grow exponentially. This in turn makes the Q functions grow exponentially. Note that after 40k samples the alpha loss is hovering around zero for the runs with the target entropy 0.67, but the return drops to initial values when the policy was random. Note that the gradient for the alpha loss with the target entropy 0.11 is positive initially (until 10k steps) since the gradient is equal to alpha loss times the sign of the log_alpha. Therefore alpha decreases until 10k.

Increasing the entropy learning rate speeds up the convergence to the random policy.

These experiments seem to indicate that my hypothesis regarding the target entropy is correct. But more experiments are needed to be sure. I was wondering if anyone can try out this idea for more complex environments.

[1] Haarnoja T, Zhou A, Hartikainen K, Tucker G, Ha S, Tan J, Kumar V, Zhu H, Gupta A, Abbeel P, Levine S. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905. 2018 Dec 13.

[2] Christodoulou, Petros. “Soft actor-critic for discrete action settings.” arXiv preprint arXiv:1910.07207 (2019).

An update.

I additionally run SAC with the target entropy 0.11

It appears that with this entropy we can solve the task with the current implementation.

Code and hyperparameters

    max_concurrent_trials, num_samples, num_gpus = 1, 1, 1
    ray.init(num_gpus=num_gpus, local_mode=True)
    stop = {"timesteps_total": 50000}
    params = {
        "num_gpus": num_gpus / float(max_concurrent_trials),
        "env": "CartPole-v0",
        "gamma": 0.99,
        "tau": 0.005,
        "train_batch_size": 32,
        "target_network_update_freq": 1,
        "num_steps_sampled_before_learning_starts": 500,
        "optimization": {
            "actor_learning_rate": 0.005,
            "critic_learning_rate": 0.005,
            "entropy_learning_rate": 0.0005,
        },
        "seed": tune.choice([42, 43, 44, 45, 46, 47, 48, 49, 50]),
    }

    epsilon = 0.975
    target_entropies = [-np.array(
            # log probability of the greedy action
            epsilon * np.log(epsilon) +
            # log probability of random actions
            (1 - epsilon) * np.log((1 - epsilon) / (2 - 1)),
            dtype=np.float32,
        ),
        -0.98 * np.array(np.log(1 / 2.0), dtype=np.float32)
    ]
    for target_entropy in target_entropies:
        params.update(
            {"target_entropy": target_entropy}
        )
        tuner = tune.Tuner(
            SAC,
            tune_config=tune.TuneConfig(
                metric="episode_reward_mean",
                mode="max",
                scheduler=None,
                num_samples=num_samples,
                max_concurrent_trials=max_concurrent_trials,
            ),
            param_space=params,
            run_config=air.RunConfig(stop=stop),
        )
        results = tuner.fit()