PPO Training takes double the time of CPU on GPU

Hello everyone,

I am running cartpole-ppo example on ray 1.12.1 and pytorch 1.11.0. I observed significant differences in the learn throughput when I use GPU and CPU. I experimented with different sgd_minibatch_size with and without GPU keeping train_batch_size as 4000 and num_workers as 1. The results are as follows

Mini-bath size Learn Throughput(num_gpus=1) Learn Throughput(num_gpus=0)
64 1310 3458
128 2867 70007
256 6159 13731
512 12952 27711
1000 21824 43353

As a result training without GPU is faster than using 1 GPU.

I observed similar behavior on my project where I am using PPO agent with a custom model and a custom environment on ray 1.4.1 with pytorch 1.7.0

Can someone please help identify why training on GPU is slower? What can I do to reduce my training time?

Any help is appreciated!
Thanks in advance!

There is a certain amount of overhead that comes with using GPU, for example, the tensors need to get loaded to and from the GPU.
Therefore the forward and backward computation needs to be significant enough before it’s worth the overhead.
I can reproduce the slow down if you use the default network of a single hidden layer of size 32.
If you simply change the hidden layers to be [1024, 1024, 1024] for example, you will notice that GPU is 2x to 4x faster than CPU.
Hope this helps.

i agree with @gjoliver . RL is also quite cpu-intensive. For the cartpole example, the overhead is not in the NN parts. On the contrast, if in the image-based RL examples, like atari-game, where you might need to use a bigger network like CNN, the GPU is certain to be faster in that case.