You can find that the entropy goes to 0 in ray=1.2.0 but is stuck around 2 in ray=1.11.0
The hyper parameters of PPO are strictly identical in both trials. Environment is identical in both experiments. Summary:
sgd_minibatch_size: 512
train_batch_size: 1600 (since in MARL env the number of agents vary from 20 to 40, so the ACTUAL batch size might range from 20K to 50K)
rollout_segment_length: 200
entropy_coeff: 0
lr: 3e-4
num_sgd_iters: 5
num_workers: 4
I have identified some differences but they are not the major causes according to my experiments:
In MultiGPUTrainOneStep the batch is not shuffled as in ray=1.2.0. But my experiment using simple_optimizer yields same result. So the batch shuffling is not the cause.
The auto-adjust rollout_segment_length does not affect my result since I can assure train_batch_size 1600 is divisible by num_workers 4 * envs_per_worker 1 * rollout_segment_length 200 = 800
Please note that this is not a strict comparison. The figures above only tell that the entropy of action distribution has different behavior under different version of ray. I want to get insight on what particular part during the update might affect the entropy dynamics. Thanks!
how many times did you repeat the experiment? maybe it is because of bad initialization. that’s very common on policy gradient methods. as I know they didn’t change anything about ppo or default settings of trainer which means you are running exact same codes
Experiment repeat 4 times. I don’t think this is due to the randomness in weight initialization. Maybe some implicit changes in training workflow might result this.
I am now grid-searching different versions of ray using the same config and environment and hope figure out which version of ray I can trust
(though this task is far from my research…)