RLlib IMPALA multi GPU performance

Hello! I have a question about multi-GPU training performance of RLlib. I appreciate your answers in advance!

RLlib IMPALA supports multi-GPU training. I trained with config provided in tuned example pong-impala-fast.yaml and got expected training throughput (around 33k transitions per second). However, when I doubled the resource to 256 workers and 4 GPUs and changed nothing else, the training throughput only reached ~35k, which should be doubled (~60k) in expectation since sampling and training in IMPALA is completely asynchronous.

What’s the problem here? Is there any settings that could be tuned to get a better result? Do we have a standard benchmark of multi-GPU training for RLlib algorithm that could scale up to 8 GPUs?

What framework do you use? Does the same happen with APPO?

I tried both pytorch and tensorflow, and the results were close. They both do not scale with multi-GPU. And yes, the same happens with APPO. I thought RLlib reused IMPALA implementation for APPO. Is there any significant difference?

In RLlib, APPO adds a target network and a KL loss to IMPALA.
We do have a multi (2) GPU release test for APPO.
RLlib is undergoing major changes around Multi-GPU training though.
I’m sure @avnishn has to say more about this.