[RLlib] Ray trains extremely slow when learner queue is full

In the IMPALA trainer when the learner queue is full (I am using an expensive model so the GPU training is slow) the algorithm seems to move training to the CPU, making the training unbearably slow. The desired behavior would be that the CPUs go idle if the GPU is catching up. This effect can be postponed by setting learner_queue_size to a higher value, but it only postpones the issue, the queue eventually fills up and training becomes incredibly slow. Is there a config variable I am missing that stops this behavior?

EDIT - despite training being extremely slow (forward pass 100x slower), all tensors seem to still on the GPU, making the issue even more perplexing.

1 Like

EDIT - the below fix is not a fix. Setting learner queue size to 0 just makes it infinite size, which makes the learner learn from way off policy data and consumes more and more memory until an out of memory error occurs.

For anyone else experiencing extremely slow training when the learner queue fills up setting it’s size to 0 seems to completely fix the issue.

Interesting, I’m not sure I understand why RLlib would move training to the CPU (I don’t think that’ll happen as the trained policy is created once on the local worker and thus only placed once on the GPU (or CPU if no GPU is available)).
Are you on tf or torch?
Is there an easy setup that would reproduce this issue? I guess we could use a custom model that’s fast for inference (call), but artificially slow on train batches (from_batch).

General advice (you probably tried all this already): Have you tried multi-GPU (tf only) or decreasing your num_envs_per_worker, or num_workers?
IMPALA is all about finding the right balance between incoming data and consumer (learner) speed.

Thanks for the reply - I am using torch and would rather not switch as I am using a custom model written in pytorch and in other parts of the project a custom trainer also written for pytorch. I suspect what you said will work as far as reproducing - I have tried a bunch of different methods to circumvent the issue and it seems as though anytime experience is produced faster than the learner can use it and the learner queue fills up, training becomes extremely slow.

I have played with num_envs_per_worker and num_workers and setting these low enough gets it to work fine, experience is just generated quite slowly, most likely much slower than if the learner queue was functioning correctly and could have ready-to-go data without making training super slow when it fills up.

Ok, I would like to debug this. Do you think this is reproducible with e.g. CartPole and a very simple model on the GPU by making the action-compute pass much faster than the batched train pass (by inserting an artificial sleep for example)? I guess one would have to set the queue to something large (e.g. 10k), then do a few initial train steps to check whether training is fast if the queue is not full yet, and then later check the training times after the queue has reached its capacity limit.

Yes, I think this should work - In a custom model assuming you do not have batched inference set up you can add a sleep that is only activated when the batch size is bigger than one. The queue should not have to be too big (since I believe it is in number of rollout fragments) if the sleep is set at a reasonable amount. As long as the learner queue mean is increasing inside tensorboard it should max out and at that time learner grad time should spike.

I also think I’m running into this bug and would like to know if there’s any fix, or workarounds.

Other things I have noticed:

the “sample_throughput” variable skyrockets and i get values that are something like 600k samples per second.

Graphs on tensorflow show the point at which training becomes 100x slower:

crossref to Very slow gradient descent on remote workers