PPO entropy not decreasing in Ray=1.11.0 as Ray=1.2.0?

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi, I am training PPO agents in MetaDrive environment and find that the training dynamics significantly diverges between ray==1.2.0 and ray==1.11.0

You can find that the entropy goes to 0 in ray=1.2.0 but is stuck around 2 in ray=1.11.0

The hyper parameters of PPO are strictly identical in both trials. Environment is identical in both experiments. Summary:

  • sgd_minibatch_size: 512
  • train_batch_size: 1600 (since in MARL env the number of agents vary from 20 to 40, so the ACTUAL batch size might range from 20K to 50K)
  • rollout_segment_length: 200
  • entropy_coeff: 0
  • lr: 3e-4
  • num_sgd_iters: 5
  • num_workers: 4

I have identified some differences but they are not the major causes according to my experiments:

  • In MultiGPUTrainOneStep the batch is not shuffled as in ray=1.2.0. But my experiment using simple_optimizer yields same result. So the batch shuffling is not the cause.
  • The auto-adjust rollout_segment_length does not affect my result since I can assure train_batch_size 1600 is divisible by num_workers 4 * envs_per_worker 1 * rollout_segment_length 200 = 800

Please note that this is not a strict comparison. The figures above only tell that the entropy of action distribution has different behavior under different version of ray. I want to get insight on what particular part during the update might affect the entropy dynamics. Thanks!

Oops the figure is bad

This is ray=1.11.0 entropy dynamics:

And this is ray=1.2.0 entropy dynamics:

how many times did you repeat the experiment? maybe it is because of bad initialization. that’s very common on policy gradient methods. as I know they didn’t change anything about ppo or default settings of trainer which means you are running exact same codes

Experiment repeat 4 times. I don’t think this is due to the randomness in weight initialization. Maybe some implicit changes in training workflow might result this.

I am now grid-searching different versions of ray using the same config and environment and hope figure out which version of ray I can trust
(though this task is far from my research…)

In ray=1.10.0, the entropy is also decreasing so slowly

Super interesting! I have tried many ray versions and find that ray=1.4.0 already yields strange entropy. Here is the plots:

ray=1.4.0:

ray=1.3.0:

ray=1.2.0:

The conclusion is that the change between ray=1.3.0 and ray=1.4.0 causes the difference. I will dive into to see why

I have identified the strange behavior of PPO entropy emerged in ray=1.4.0

Could anyone help to identify the possible cause to this? Thank much in advance!

I can report more details here:

In ray=1.3.0:

  • the value loss increases quickly (in first 15 iterations) to 40
  • the episode length is relative short since agents explore and die quickly and the number of episodes per iteration is 1.5

In ray=1.4.0:

  • the value loss increases slowly to 20
  • the episode length is large since agents are almost not moving (taking random actions) and the episode per iteration is 1

We all know that “entropy is too high” means the exploration is too strong. This might because the value function is not learned rapidly in ray=1.4.0.

I am not sure which component in the system might affect this.

The issue is still here even with ray=2.2.0

Independent PPO trainer in a dense multi-agent environment achieves too high entropy.