Minibatch for APPO


PPO (and DDPPO) has the option sgd_minibatch_size that splits train_batch_size to smaller chunks for each SGD iteration. However, APPO doesn’t have this option only minibatch_buffer_size.

So how does APPO do minibaching, and what controls the size of the minibaches?


From IMPALA (which should be the same for APPO):

 # == Overview of data flow in IMPALA ==
    # 1. Policy evaluation in parallel across `num_workers` actors produces
    #    batches of size `rollout_fragment_length * num_envs_per_worker`.
    # 2. If enabled, the replay buffer stores and produces batches of size
    #    `rollout_fragment_length * num_envs_per_worker`.
    # 3. If enabled, the minibatch ring buffer stores and replays batches of
    #    size `train_batch_size` up to `num_sgd_iter` times per batch.
    # 4. The learner thread executes data parallel SGD across `num_gpus` GPUs
    #    on batches of size `train_batch_size`.

So that means (assuming that the replay and minibatch ring buffers are both enabled), that train_batch_size in PPO roughly corresponds to rollout_fragment_length * num_envs_per_worker in APPO and sgd_minibatch_size to train_batch_size.

But the role or minibatch_buffer_size is still not fully clear to me.
Also, it states:

# number of passes to make over each train batch
"num_sgd_iter": 1

but in practice if num_sgd_iter=3 then it samples randomly 3 batches (each size train_batch_size) from the minibatch ring, and it doesn’t do 3 passes on each batch. Is that correct?

Hey @vakker00 , great question! Agree, this is a little confusing and should be cleaned up!

APPO is an asynchronous version of PPO, kind of like a hybrid between IMPALA and PPO. For these asynchronous algos, sampling happens on n rollout workers in parallel (just like in most other algos), but then the collected data is sent right away (asynchronously) to a queue (in PPO on the other hand, collected data is sent synchronously to result in one large train batch).
The APPO learner (local worker) then takes whatever arrives in the queue and performs an update. In this async setting, it would not make sense to perform n SGD iters on each train batch. The point here is to reach as much continuous (async) throughput as possible.

To answer your question: APPO does not do minibatching and the minibatch_buffer_size should always be 1 to reflect that. It should also have the same value as num_sgd_iter as these two settings are the same. I remember there is an error in IMPALA when you set num_sgd_iter > 1, and I think it’s because of this. Again, we should fix this and make this more clear in the comments.

1 Like

Thanks for the clarification, just a quick follow question as I was digging more into APPO recently.

The tuned pong-appo example has minibatch_buffer_size: 4 and num_sgd_iter: 2, so is that intentional? You mentioned that both should be 1.

Also, the doc for IMPALA suggests that you can actually use larger num_sgd_iter settings:

    # How many train batches should be retained for minibatching. This conf
    # only has an effect if `num_sgd_iter > 1`.
    "minibatch_buffer_size": 1,