Hi, I was told that when using the Pytorch backend for PPO, on a GPU, the training batch is copied multiple times (number of minibatch updates) to the GPU, thereby slowing the learning process. I just wanted to know whether the latest version of RLlib has fixed the issue.
In PPO, doing minibatch passes over the training data occurs here: ray/train_ops.py at 3e010c5760c99be5a9940001f33db087c52eb8e7 · ray-project/ray · GitHub
Based on the code, it looks like the batch is already loaded onto the GPU.
Thank you, the latest version of Rllib does seem to have fixed that problem.
1 Like
Hey @psxz , yes, we fixed that problem in the latest master. Now, all torch algos use the unified multi-GPU exec. op (that was before only available for tf).
This change will be included in the upcoming 1.6 ray release.
We measured an speed increase for PPO + 1GPU on Atari of roughly 35% because of this change.
Thank you, that will be quite helpful.