Tuning entropy in PPO


I’m trying to tune exploration settings in PPO. In the default config, the entropy related values are entropy_coeff = 0.0 and entropy_schedule = None. It doesn’t make sense to me, as the way I would interpret those default settings is that the agent has no incentive to explore. However, when I’ve experimented with increasing the entropy coefficient or by scheduling decaying entropy values, the models generally tend to perform worse than with the default settings. Few of the tuned examples modify these entropy related settings, which just seem odd to me. I’m pretty sure I don’t understand the full picture here, so does anyone care to explain?

Here is the loss calculation in PPO for reference:

total_loss = reduce_mean_valid(
    + policy.kl_coeff * action_kl
    + policy.config["vf_loss_coeff"] * vf_loss
    - policy.entropy_coeff * curr_entropy

PS: The background for asking this is that I’m comparing SAC and PPO for a custom made environment. In SAC, there’s a sense that the higher the maximum_entropy is set, the more exploration happens. I know that the entropy in SAC and PPO signify different things, but what I’m trying to do is to compare exploration rates in the two algorithms.

Hey @ulrikah , thanks for the question. It’s true, by default (entropy_coeff=0.0), we don’t incentivize the algo to produce high-entropy actions. I think it depends on the task you want to learn. For example for CartPole or Atari, you may not want to have too much focus on high entropy while you learn (I usually chose small initial_alphas for SAC testing on CartPole, for lowering the entropy in general), but for HalfCheetah, this is completely different.

Thanks for the response!

Is this heuristic something you have developed empirically or is it related to the size of the action space or some other aspect of the environment?