I am using A3C, and from my reading of the documentation I, thought the GPU would pass on batches of size train_batch_size each training step. But this does not appear to be the case because when I increase train_batch_size the GPU memory use does not increase. I am suspicious that no batching is happening, because using a GPU is mildly slower than not using one, and I am using a conv net.
Furthermore, when I use PPO, I can set sgd_minibatch_size, which is clearly working. As I increase sgd_minibatch_size, I can see that more GPU memory is used, and the training times decrease.
Could someone please explain to be if either I am wrong, or if not, how I can set the batch size for a training step when using A3C? Thank you.
Update: dropping down to A2C and setting microbatch_size also uses my GPU. But I can see no similar parameter that I can use for A3C.
This does not directly address your question but I thought you might be interested anyway. For me I usually use A2C when I want to do synchronous training and Impala when I want asynchronous training. I first started doing this because of this blog post OpenAI Baselines: ACKTR & A2C where they make the comment that:
Our synchronous A2C implementation performs better than our asynchronous implementations — we have not seen any evidence that the noise introduced by asynchrony provides any performance benefit. This A2C implementation is more cost-effective than A3C when using single-GPU machines, and is faster than a CPU-only A3C implementation when using larger policies.
P.S. You may need to set num_gpus_per_worker to a fractional value like num_workers/num_gpus to use the GPUs during training. This is just a guess on my part.
One detail I am not sure about is whether each worker will collect train_batch_size samples or train_batch_size // num_workers samples. @sven1977 should know.
Thank you, that is useful. I have 1 GPU so perhaps A2C is the way to be. I thought IMPALA was for multi-task learning however, not for single task?
Regarding your Qs, each worker will approximately collect train_batch_size / num_workers. I will not need to use fractional GPUs because I do not want the rollout workers to use the GPU, I only want the local Trainer to use the GPU when performing a training step.
Oh wow. That explains a lot then. I expect that if I want to use a GPU for computing gradients I will probably need to give the rollout workers a fractional GPU if there is no local trainer. But anyway, you have talked me into using A2C.