How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Is there an easy built-in way in RLlib to do policy regularization in a vanilla policy gradient (PG) algorithm?
Hi @mgerstgrasser , maybe using the same as in PPO?
@Lars_Simon_Zehnder Yeah, that’s my thought as well - I guess my question is if I have to implement a custom policy for this, or if there’s an easier way?
@mgerstgrasser if you can be hacky than you might implement it by using the callback
on_create_policy() and overwrite the loss function such that it contains also an entropy term.
@Lars_Simon_Zehnder Ah, that’s an interesting idea - I’ll see if that works; I suppose doing a custom policy is also not that much harder.
While I have your attention, another quick question for my understanding: In PPO, we do
curr_entropy = curr_action_dist.entropy()
So I assume the bigger
curr_entropy is, the more entropy in the policy, i.e. the more exploration.
And then in the loss calcuation we do
total_loss = ... - self.entropy_coeff * curr_entropy
Am I reading that correctly that if
self.entropy_coeff is positive, then larger entropy is penalized more? Or in other words, if I want to encourage exploration, I should set
entropy_coff < 0? Or am I mixing something up with the signs here somewhere?
@mgerstgrasser no it isn’t. Just inherit from the
PGPolicy and override the
loss(). If you need then a coefficient schedule, you can use a mixin.
Yup, that is tricky. If you look into the paper you see that what PPO does is it maximizes a surrogate objective. As we minimize in the
Policy the surrogate objective is set negative and so is the entropy loss - more entropy is better. Hope that clarifies it.
@Lars_Simon_Zehnder Not entirely - should the
entropy_coeff be positive then?
@mgerstgrasser Yes it should
Perhaps this will help explain why the entropy coefficient should be positive.
The goal of an entropy maximization regularizer is that increase the entropy of the action logits produced by the policy. If we were optimizing with gradient asscent then a larger entropy would be better.
But the optimizers we use in DL libraries (Adam, sgd, etc) perform gradient descent not ascent. So instead we attempt minimize a loss function.
This of course means that larger values are worse but we have a mismatch because we want higher entropy to be better. The easiest way to make larger entropy values worse is to negate them. (5 > 0.2) but (-5 < -0.2)
You will see this used regularly in RL, and DL for that matter, where we convert gradient ascent of into gradient descent by negating the term.