How to set "train_batch_size" appropriately?

I have a multi-agent setup with a custom environment and model. I am using a single GPU for learning and multiple rollout workers, as I increase the “train_batch_size” “num_agent_steps_trained” also increases but after a threshold, training starts failing giving worker died error. Looking at debug logs I suspect that SGD epochs are taking a long time due to which worker process is dying.
I want to know the correct way to set the “train_batch_size” so that I can run training to utilize my GPU properly and get the maximum throughput without the worker process failing, also is there some timeout time for the worker process above which it dies, if so how can be increased so that “train_batch_size” could also be increased.

Hi @Siddharth_Jain ,

I assume increasing the train_batch_size should result in more num_agent_steps_trained as the batch is now larger and each batch includes more agent steps. That the worker dies, if you increase the batch size more and more is probably due to the memory size of your GPU. At some time it simply runs full and crashes. Take a look at your GPU memory size and at the length of each sample in the batch as well as batch size and see if these sizes match.