Hi, I’m training a custom model with discrete action space using PPO. In my understanding, for the RLlib implementation, both the KL penalty and clipping are used. I apply action masking as is indicated in action masking example, and it seems to work in my environment. However, according to the tensorboard, I saw that the KL became infinity and so did the total loss. I suppose this is due to action masking since it changes the distribution severely.
So my question is: should we only use clip range (i.e. set kl_coeff=0.0) when applying action masking?