@Lars_Simon_Zehnder Yeah, that’s my thought as well - I guess my question is if I have to implement a custom policy for this, or if there’s an easier way?

@mgerstgrasser if you can be hacky than you might implement it by using the callback on_create_policy() and overwrite the loss function such that it contains also an entropy term.

@Lars_Simon_Zehnder Ah, that’s an interesting idea - I’ll see if that works; I suppose doing a custom policy is also not that much harder.

While I have your attention, another quick question for my understanding: In PPO, we do curr_entropy = curr_action_dist.entropy()
So I assume the bigger curr_entropy is, the more entropy in the policy, i.e. the more exploration.
And then in the loss calcuation we do total_loss = ... - self.entropy_coeff * curr_entropy
Am I reading that correctly that if self.entropy_coeff is positive, then larger entropy is penalized more? Or in other words, if I want to encourage exploration, I should set entropy_coff < 0? Or am I mixing something up with the signs here somewhere?

@mgerstgrasser no it isn’t. Just inherit from the PGPolicy and override the loss(). If you need then a coefficient schedule, you can use a mixin.

Yup, that is tricky. If you look into the paper you see that what PPO does is it maximizes a surrogate objective. As we minimize in the Policy the surrogate objective is set negative and so is the entropy loss - more entropy is better. Hope that clarifies it.

Perhaps this will help explain why the entropy coefficient should be positive.

The goal of an entropy maximization regularizer is that increase the entropy of the action logits produced by the policy. If we were optimizing with gradient asscent then a larger entropy would be better.

But the optimizers we use in DL libraries (Adam, sgd, etc) perform gradient descent not ascent. So instead we attempt minimize a loss function.

This of course means that larger values are worse but we have a mismatch because we want higher entropy to be better. The easiest way to make larger entropy values worse is to negate them. (5 > 0.2) but (-5 < -0.2)

You will see this used regularly in RL, and DL for that matter, where we convert gradient ascent of into gradient descent by negating the term.