I recently upgraded my Ray to 1.8.0 (was previously using 1.2.0) and realized that there is a mismatch between training results of PPO. I was wondering if there is any known changes that can cause this problem?
I binary searched the version and it happens in the upgrade from 1.5.2 to 1.6.0
I initialize a network take one iteration of updates with PPO. Initialization looks the same with the same seed in 1.5.2 and 1.6.0 but the L2 norm of network weights are different after 1 iteration.
I am using Ray locally and not on the server for the above (there is only 1 worker)
Thank you so much.