Does KL loss make sense when using action masking in PPO?

Hi, I’m training a custom model with discrete action space using PPO. In my understanding, for the RLlib implementation, both the KL penalty and clipping are used. I apply action masking as is indicated in action masking example, and it seems to work in my environment. However, according to the tensorboard, I saw that the KL became infinity and so did the total loss. I suppose this is due to action masking since it changes the distribution severely.

So my question is: should we only use clip range (i.e. set kl_coeff=0.0) when applying action masking?

About the KL explosion with action masking. In my opinion, the action is sampled according to the re-normalized distribution after masking (say, p_r). But I can not confirm whether the policy gradient is updated according to p_r or the distribution before masking p

1 Like

Came across this and interested to know the answer too