PPO (and DDPPO) has the option
sgd_minibatch_size that splits
train_batch_size to smaller chunks for each SGD iteration. However, APPO doesn’t have this option only
So how does APPO do minibaching, and what controls the size of the minibaches?
From IMPALA (which should be the same for APPO):
# == Overview of data flow in IMPALA ==
# 1. Policy evaluation in parallel across `num_workers` actors produces
# batches of size `rollout_fragment_length * num_envs_per_worker`.
# 2. If enabled, the replay buffer stores and produces batches of size
# `rollout_fragment_length * num_envs_per_worker`.
# 3. If enabled, the minibatch ring buffer stores and replays batches of
# size `train_batch_size` up to `num_sgd_iter` times per batch.
# 4. The learner thread executes data parallel SGD across `num_gpus` GPUs
# on batches of size `train_batch_size`.
So that means (assuming that the
minibatch ring buffers are both enabled), that
train_batch_size in PPO roughly corresponds to
rollout_fragment_length * num_envs_per_worker in APPO and
But the role or
minibatch_buffer_size is still not fully clear to me.
Also, it states:
# number of passes to make over each train batch
but in practice if
num_sgd_iter=3 then it samples randomly 3 batches (each size
train_batch_size) from the
minibatch ring, and it doesn’t do 3 passes on each batch. Is that correct?
Hey @vakker00 , great question! Agree, this is a little confusing and should be cleaned up!
APPO is an asynchronous version of PPO, kind of like a hybrid between IMPALA and PPO. For these asynchronous algos, sampling happens on n rollout workers in parallel (just like in most other algos), but then the collected data is sent right away (asynchronously) to a queue (in PPO on the other hand, collected data is sent synchronously to result in one large train batch).
The APPO learner (local worker) then takes whatever arrives in the queue and performs an update. In this async setting, it would not make sense to perform n SGD iters on each train batch. The point here is to reach as much continuous (async) throughput as possible.
To answer your question: APPO does not do minibatching and the minibatch_buffer_size should always be 1 to reflect that. It should also have the same value as
num_sgd_iter as these two settings are the same. I remember there is an error in IMPALA when you set num_sgd_iter > 1, and I think it’s because of this. Again, we should fix this and make this more clear in the comments.
Thanks for the clarification, just a quick follow question as I was digging more into APPO recently.
The tuned pong-appo example has
minibatch_buffer_size: 4 and
num_sgd_iter: 2, so is that intentional? You mentioned that both should be 1.
Also, the doc for IMPALA suggests that you can actually use larger
# How many train batches should be retained for minibatching. This conf
# only has an effect if `num_sgd_iter > 1`.