PPO entropy not decreasing in Ray=1.11.0 as Ray=1.2.0?

pengzh · April 17, 2022, 5:48am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi, I am training PPO agents in MetaDrive environment and find that the training dynamics significantly diverges between ray==1.2.0 and ray==1.11.0

You can find that the entropy goes to 0 in ray=1.2.0 but is stuck around 2 in ray=1.11.0

The hyper parameters of PPO are strictly identical in both trials. Environment is identical in both experiments. Summary:

sgd_minibatch_size: 512
train_batch_size: 1600 (since in MARL env the number of agents vary from 20 to 40, so the ACTUAL batch size might range from 20K to 50K)
rollout_segment_length: 200
entropy_coeff: 0
lr: 3e-4
num_sgd_iters: 5
num_workers: 4

I have identified some differences but they are not the major causes according to my experiments:

In MultiGPUTrainOneStep the batch is not shuffled as in ray=1.2.0. But my experiment using simple_optimizer yields same result. So the batch shuffling is not the cause.
The auto-adjust rollout_segment_length does not affect my result since I can assure train_batch_size 1600 is divisible by num_workers 4 * envs_per_worker 1 * rollout_segment_length 200 = 800

Please note that this is not a strict comparison. The figures above only tell that the entropy of action distribution has different behavior under different version of ray. I want to get insight on what particular part during the update might affect the entropy dynamics. Thanks!

pengzh · April 17, 2022, 6:44am

Oops the figure is bad

This is ray=1.11.0 entropy dynamics:

And this is ray=1.2.0 entropy dynamics:

hossein836 · April 17, 2022, 9:44am

how many times did you repeat the experiment? maybe it is because of bad initialization. that’s very common on policy gradient methods. as I know they didn’t change anything about ppo or default settings of trainer which means you are running exact same codes

pengzh · April 18, 2022, 2:54am

Experiment repeat 4 times. I don’t think this is due to the randomness in weight initialization. Maybe some implicit changes in training workflow might result this.

I am now grid-searching different versions of ray using the same config and environment and hope figure out which version of ray I can trust
(though this task is far from my research…)

pengzh · April 18, 2022, 3:15am

In ray=1.10.0, the entropy is also decreasing so slowly

pengzh · April 18, 2022, 7:04am

Super interesting! I have tried many ray versions and find that ray=1.4.0 already yields strange entropy. Here is the plots:

ray=1.4.0:

ray=1.3.0:

ray=1.2.0:

The conclusion is that the change between ray=1.3.0 and ray=1.4.0 causes the difference. I will dive into to see why

pengzh · April 18, 2022, 1:32pm

I have identified the strange behavior of PPO entropy emerged in ray=1.4.0

Could anyone help to identify the possible cause to this? Thank much in advance!

pengzh · April 18, 2022, 3:13pm

I can report more details here:

In ray=1.3.0:

the value loss increases quickly (in first 15 iterations) to 40
the episode length is relative short since agents explore and die quickly and the number of episodes per iteration is 1.5

In ray=1.4.0:

the value loss increases slowly to 20
the episode length is large since agents are almost not moving (taking random actions) and the episode per iteration is 1

We all know that “entropy is too high” means the exploration is too strong. This might because the value function is not learned rapidly in ray=1.4.0.

I am not sure which component in the system might affect this.

pengzh · January 9, 2023, 12:49am

The issue is still here even with ray=2.2.0

Independent PPO trainer in a dense multi-agent environment achieves too high entropy.

Topic		Replies	Views
Mismatch between the results of PPO after upgrading to Ray 1.8.0 RLlib	2	330	December 15, 2021
Ray.rllib.agents.ppo missing RLlib	3	7600	March 27, 2023
Entropy value in IMPALA RLlib	8	792	April 21, 2021
PPO.train incorrect result RLlib	1	260	May 23, 2023
Unexpected dramatic drop in reward RLlib	8	966	November 13, 2023

PPO entropy not decreasing in Ray=1.11.0 as Ray=1.2.0?

Related topics