Batch sizes on GPU

Muff2n · July 26, 2022, 10:11am

Low: It annoys or frustrates me for a moment.

I am using A3C, and from my reading of the documentation I, thought the GPU would pass on batches of size train_batch_size each training step. But this does not appear to be the case because when I increase train_batch_size the GPU memory use does not increase. I am suspicious that no batching is happening, because using a GPU is mildly slower than not using one, and I am using a conv net.

Furthermore, when I use PPO, I can set sgd_minibatch_size, which is clearly working. As I increase sgd_minibatch_size, I can see that more GPU memory is used, and the training times decrease.

Could someone please explain to be if either I am wrong, or if not, how I can set the batch size for a training step when using A3C? Thank you.

Update: dropping down to A2C and setting microbatch_size also uses my GPU. But I can see no similar parameter that I can use for A3C.

mannyv · July 27, 2022, 12:31pm

@Muff2n,

This does not directly address your question but I thought you might be interested anyway. For me I usually use A2C when I want to do synchronous training and Impala when I want asynchronous training. I first started doing this because of this blog post OpenAI Baselines: ACKTR & A2C where they make the comment that:

Our synchronous A2C implementation performs better than our asynchronous implementations — we have not seen any evidence that the noise introduced by asynchrony provides any performance benefit. This A2C implementation is more cost-effective than A3C when using single-GPU machines, and is faster than a CPU-only A3C implementation when using larger policies.

P.S. You may need to set num_gpus_per_worker to a fractional value like num_workers/num_gpus to use the GPUs during training. This is just a guess on my part.

One detail I am not sure about is whether each worker will collect train_batch_size samples or train_batch_size // num_workers samples. @sven1977 should know.

Muff2n · July 27, 2022, 1:02pm

Thank you, that is useful. I have 1 GPU so perhaps A2C is the way to be. I thought IMPALA was for multi-task learning however, not for single task?

Regarding your Qs, each worker will approximately collect train_batch_size / num_workers. I will not need to use fractional GPUs because I do not want the rollout workers to use the GPU, I only want the local Trainer to use the GPU when performing a training step.

mannyv · July 27, 2022, 1:03pm

@Muff2n when you use A3C there is no local trainer doing updates. Each rollout_worker computes gradients on the samples it collects and then broadcasts the gradients to the other workers.

Muff2n · July 27, 2022, 1:09pm

Oh wow. That explains a lot then. I expect that if I want to use a GPU for computing gradients I will probably need to give the rollout workers a fractional GPU if there is no local trainer. But anyway, you have talked me into using A2C.

mannyv · July 27, 2022, 1:13pm

That behavior is defined here if you are curios.

github.com

ray-project/ray/blob/e4ce38d001dbbe09cd21c497fedd03d692b2be3e/rllib/agents/a3c/a3c.py#L129-L168


      
          # Loop through all fetched worker-computed gradients (if any)
          # and apply them - one by one - to the local worker's model.
          # After each apply step (one step per worker that returned some gradients),
          # update that particular worker's weights.
          global_vars = None
          learner_info_builder = LearnerInfoBuilder(num_devices=1)
          for worker, results in async_results.items():
              for result in results:
                  # Apply gradients to local worker.
                  with self._timers[APPLY_GRADS_TIMER]:
                      local_worker.apply_gradients(result["grads"])
                  self._timers[APPLY_GRADS_TIMER].push_units_processed(
                      result["agent_steps"]
                  )
          
          
        # Update all step counters.
                  self._counters[NUM_AGENT_STEPS_SAMPLED] += result["agent_steps"]
                  self._counters[NUM_ENV_STEPS_SAMPLED] += result["env_steps"]
                  self._counters[NUM_AGENT_STEPS_TRAINED] += result["agent_steps"]
                  self._counters[NUM_ENV_STEPS_TRAINED] += result["env_steps"]

This file has been truncated. show original

Happy to help

Topic		Replies	Views
RLLib PPO Trainer allocating additional memory on second training iteration RLlib	0	307	July 21, 2022
How to set "train_batch_size" appropriately? RLlib	1	1053	October 30, 2021
Example of A3C only use CPU for trainer RLlib	10	876	July 23, 2021
PPO is using too much GPU memory RLlib	3	1987	July 28, 2021
How do I set GPU affinity of workers RLlib	17	2560	April 23, 2021

Batch sizes on GPU

Related topics