Hi,
I’m trying to tune exploration settings in PPO. In the default config, the entropy related values are entropy_coeff = 0.0
and entropy_schedule = None
. It doesn’t make sense to me, as the way I would interpret those default settings is that the agent has no incentive to explore. However, when I’ve experimented with increasing the entropy coefficient or by scheduling decaying entropy values, the models generally tend to perform worse than with the default settings. Few of the tuned examples modify these entropy related settings, which just seem odd to me. I’m pretty sure I don’t understand the full picture here, so does anyone care to explain?
Here is the loss calculation in PPO for reference:
total_loss = reduce_mean_valid(
-surrogate_loss
+ policy.kl_coeff * action_kl
+ policy.config["vf_loss_coeff"] * vf_loss
- policy.entropy_coeff * curr_entropy
PS: The background for asking this is that I’m comparing SAC and PPO for a custom made environment. In SAC, there’s a sense that the higher the maximum_entropy
is set, the more exploration happens. I know that the entropy in SAC and PPO signify different things, but what I’m trying to do is to compare exploration rates in the two algorithms.