Hello! I have a question about multi-GPU training performance of RLlib. I appreciate your answers in advance!
RLlib IMPALA supports multi-GPU training. I trained with config provided in tuned example pong-impala-fast.yaml and got expected training throughput (around 33k transitions per second). However, when I doubled the resource to 256 workers and 4 GPUs and changed nothing else, the training throughput only reached ~35k, which should be doubled (~60k) in expectation since sampling and training in IMPALA is completely asynchronous.
What’s the problem here? Is there any settings that could be tuned to get a better result? Do we have a standard benchmark of multi-GPU training for RLlib algorithm that could scale up to 8 GPUs?