PPO training, kl loss divergence and stability problems

Claudio · March 19, 2025, 8:54pm

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

Ray version: 2.42.1
Python version: 3.10.6
OS: Linux
Other libs/tools (if relevant): Julia

3. What happened vs. what you expected:
I am facing difficulties in training an agent in a rather complex environment. I briefly describe it for reference.

Obs: 12 (between ±1)
Act: 5 (mean) (between ±1), + 5 (log_std)
Short episodes (an expert agent would solve it in about 7 steps)
Rather complex dynamics of the environment (space trajectory)
Step returns are about -3 / +5, and last step is either success (+10) or impact (-2). More or less, min episode return is -13, max return is 30.

I tried different configuration for my custom net. The one I am trying to use now is composed of 3 nets:

Mean net: 2 hid lay, 128 neurons, tanh + 1 out lay, 5 neurons, tanh
Log_std net: 2 hid lay, 128 neurons, tanh + 1 out lay, 5 neurons, tanh
Vf: 2 hid lay, 256 neurons, tanh + 1 out lay, 1 neurons, linear I also tried a version with common mean / log_std structure, and different n of neurons / layers (for example 256 mean / log_std and 1024 for vf, 3 hid layers)

To get a reasonable std at the beginning of the training, I set the biases of the output layer of the log_std net to 10, thus tanh returns almost 1. In the forward method, I do low_bound + 0.5 * (up_bound - low_bound) * (self._policy_net_log_std(features) + 1) with low_bound = -5.0 and up_bound = 0.0. This means that, at the beginning of the training, std is almost 1, and should decrease as the training proceeds. I also tried different low / up bounds, and also different biases not to initiate the training in either limit.

For what concern hyperparameters, also here different configurations have been tested.
Gamma: 0.99

lr: 3e-3, 1e-3, 3e-4, 1e-4, and so on
lambda_: 0.97, 0.95, 0.90
entropy_coeff: 3e-3, 3e-4, 3e-5, 3e-6
clip_param: 3/2/1e-1, 3/2/1e-2
vf_loss_coeff = 1
vf_clip_param = 10, None
train_batch_size_per_learner = 1024, 2048
minibatch_size = 64, 128, 256, 512
num_epoch = 1, 3, 5, 10

Here it comes the problem. For some configurations, everything seems to work up to a certain point of the training. In particular, high initial std and low lr seems the most promising configuration. However, all of the sudden there appear instabilities in the curr_kl_coeff and mean_kl_loss. Sometimes also the policy_loss explodes.

Only one configuration “worked” (256 neurons, 2 hid lay, for mean and log_std (separated) and log_std biases to 10 / 1024 neurons, 2 hid lay, for vf) with lr = 3e-4, lambda_ = 0.95, entropy_coeff = 3e-4, clip_param = 3e-1, vf_clip_param = None, train_batch_size_per_learner = 2048, minibatch_size = 128, num_epochs = 5. However, even if learning properly, the entropy remained constant and did not decrease, making me think that the std did not decrease either. Therefore, training seemed to be successful, but entirely stochastic.

In all unsuccessful cases, at a certain point of the training it seems that gradients explode:
WARNING torch_learner.py:260 – Skipping this update. If updates with nan/inf gradients should not be skipped entirely and instead nan/inf gradients set to zero set torch_skip_nan_gradients to False.

Did anyone face similar problems / find a solution to that?
Thank you

Topic		Replies	Views
Unable to replicate original PPO performance RLlib	0	178	May 10, 2024
PPO nan in actor logits RLlib	7	683	October 1, 2024
PPO gives "Infinity" value for kl and total_loss RLlib	5	1548	October 1, 2021
How to do early stopping in case of large kl divergence with ppo RLlib	0	1329	August 26, 2021
PPO entropy not decreasing in Ray=1.11.0 as Ray=1.2.0? RLlib	8	1165	January 9, 2023

PPO training, kl loss divergence and stability problems

Related topics