Tuning entropy in PPO

ulrikah · April 16, 2021, 8:11am

Hi,

I’m trying to tune exploration settings in PPO. In the default config, the entropy related values are entropy_coeff = 0.0 and entropy_schedule = None. It doesn’t make sense to me, as the way I would interpret those default settings is that the agent has no incentive to explore. However, when I’ve experimented with increasing the entropy coefficient or by scheduling decaying entropy values, the models generally tend to perform worse than with the default settings. Few of the tuned examples modify these entropy related settings, which just seem odd to me. I’m pretty sure I don’t understand the full picture here, so does anyone care to explain?

Here is the loss calculation in PPO for reference:

total_loss = reduce_mean_valid(
    -surrogate_loss
    + policy.kl_coeff * action_kl
    + policy.config["vf_loss_coeff"] * vf_loss
    - policy.entropy_coeff * curr_entropy

PS: The background for asking this is that I’m comparing SAC and PPO for a custom made environment. In SAC, there’s a sense that the higher the maximum_entropy is set, the more exploration happens. I know that the entropy in SAC and PPO signify different things, but what I’m trying to do is to compare exploration rates in the two algorithms.

sven1977 · April 16, 2021, 1:35pm

Hey @ulrikah , thanks for the question. It’s true, by default (entropy_coeff=0.0), we don’t incentivize the algo to produce high-entropy actions. I think it depends on the task you want to learn. For example for CartPole or Atari, you may not want to have too much focus on high entropy while you learn (I usually chose small initial_alphas for SAC testing on CartPole, for lowering the entropy in general), but for HalfCheetah, this is completely different.

ulrikah · April 16, 2021, 1:41pm

Thanks for the response!

Is this heuristic something you have developed empirically or is it related to the size of the action space or some other aspect of the environment?

Topic		Replies	Views
Exploration in PPO and policy gradient algorithms RLlib	1	751	November 21, 2021
Entropy Regularization in PG? RLlib	9	878	September 17, 2022
Struggling tuning Soft Actor Critic RLlib	0	181	October 18, 2023
Breakdown of config and metrics of PPO implementation RLlib	0	664	February 23, 2022
Tips for tuning in a competitive multi-agent turn based environment RLlib	2	787	April 9, 2021

Tuning entropy in PPO

Related topics