Entropy Regularization in PG?

mgerstgrasser · September 15, 2022, 5:33pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Is there an easy built-in way in RLlib to do policy regularization in a vanilla policy gradient (PG) algorithm?

Lars_Simon_Zehnder · September 15, 2022, 7:09pm

Hi @mgerstgrasser , maybe using the same as in PPO?

mgerstgrasser · September 15, 2022, 8:24pm

@Lars_Simon_Zehnder Yeah, that’s my thought as well - I guess my question is if I have to implement a custom policy for this, or if there’s an easier way?

Lars_Simon_Zehnder · September 16, 2022, 5:28pm

@mgerstgrasser if you can be hacky than you might implement it by using the callback on_create_policy() and overwrite the loss function such that it contains also an entropy term.

mgerstgrasser · September 16, 2022, 6:18pm

@Lars_Simon_Zehnder Ah, that’s an interesting idea - I’ll see if that works; I suppose doing a custom policy is also not that much harder.

While I have your attention, another quick question for my understanding: In PPO, we do
curr_entropy = curr_action_dist.entropy()
So I assume the bigger curr_entropy is, the more entropy in the policy, i.e. the more exploration.
And then in the loss calcuation we do
total_loss = ... - self.entropy_coeff * curr_entropy
Am I reading that correctly that if self.entropy_coeff is positive, then larger entropy is penalized more? Or in other words, if I want to encourage exploration, I should set entropy_coff < 0? Or am I mixing something up with the signs here somewhere?

Thank you!!!

Lars_Simon_Zehnder · September 16, 2022, 8:03pm

@mgerstgrasser no it isn’t. Just inherit from the PGPolicy and override the loss(). If you need then a coefficient schedule, you can use a mixin.

Yup, that is tricky. If you look into the paper you see that what PPO does is it maximizes a surrogate objective. As we minimize in the Policy the surrogate objective is set negative and so is the entropy loss - more entropy is better. Hope that clarifies it.

mgerstgrasser · September 16, 2022, 8:19pm

@Lars_Simon_Zehnder Not entirely - should the entropy_coeff be positive then?

Lars_Simon_Zehnder · September 16, 2022, 8:21pm

@mgerstgrasser Yes it should

mgerstgrasser · September 16, 2022, 8:21pm

Got it, thank you!

mannyv · September 17, 2022, 2:51am

@mgerstgrasser,

Perhaps this will help explain why the entropy coefficient should be positive.

The goal of an entropy maximization regularizer is that increase the entropy of the action logits produced by the policy. If we were optimizing with gradient asscent then a larger entropy would be better.

But the optimizers we use in DL libraries (Adam, sgd, etc) perform gradient descent not ascent. So instead we attempt minimize a loss function.

This of course means that larger values are worse but we have a mismatch because we want higher entropy to be better. The easiest way to make larger entropy values worse is to negate them. (5 > 0.2) but (-5 < -0.2)

You will see this used regularly in RL, and DL for that matter, where we convert gradient ascent of into gradient descent by negating the term.

Topic		Replies	Views
Tuning entropy in PPO RLlib	2	2820	April 16, 2021
Exploration in PPO and policy gradient algorithms RLlib	1	751	November 21, 2021
The role of the discount factor gamma in policy gradient algorithms RLlib	2	531	September 30, 2021
Breakdown of config and metrics of PPO implementation RLlib	0	668	February 23, 2022
~~Possible PPO surrogate policy loss sign error~~ RLlib	2	787	October 4, 2022

Entropy Regularization in PG?

Related topics