How to handle non-finite gradient in Rllib?

I am constantly running into this issue which is preventing me from effectively training my network. I know that the parameters, including the learning rate, can cause those issues, but I need a way to overcome this error, potentially replacing non-finite values with finite ones.

I am using the same data and parameters on Stable Baseline 3 without running into this issue.

RuntimeError: The total norm of order 2.0 for gradients from parameters is non-finite, so it cannot be clipped.

config = PPOConfig()

config = config.training(
    lr=0.003, 
    grad_clip=1.0, 
    clip_param=0.2,
    num_sgd_iter=10,
    gamma=0.99,
    lambda_=0.95,
    entropy_coeff=0
    )

Hi @Pier-Olivier_Marquis,

The first step I would take n this situation would be to figure out if the problematic gradients are coming from the action policy network or the value network.

Also two big differences in PPO between SB3 and RLLIB.

  1. SB3 does not use KL in the loss, they only use it as an early stopping heuristic.
  2. They do value function clipping very differently. SB3 treats it like actor clipping where it is relative to the initial loss on the first iteration. RLLIB treats it as an absolute clip value.

Do you have a continuous action space? If so you can run into a problem where the variance of the actions become infinitesimal and blow up the log probability.