PPO (and DDPPO) has the option
sgd_minibatch_size that splits
train_batch_size to smaller chunks for each SGD iteration. However, APPO doesn’t have this option only
So how does APPO do minibaching, and what controls the size of the minibaches?
From IMPALA (which should be the same for APPO):
# == Overview of data flow in IMPALA == # 1. Policy evaluation in parallel across `num_workers` actors produces # batches of size `rollout_fragment_length * num_envs_per_worker`. # 2. If enabled, the replay buffer stores and produces batches of size # `rollout_fragment_length * num_envs_per_worker`. # 3. If enabled, the minibatch ring buffer stores and replays batches of # size `train_batch_size` up to `num_sgd_iter` times per batch. # 4. The learner thread executes data parallel SGD across `num_gpus` GPUs # on batches of size `train_batch_size`.
So that means (assuming that the
minibatch ring buffers are both enabled), that
train_batch_size in PPO roughly corresponds to
rollout_fragment_length * num_envs_per_worker in APPO and
But the role or
minibatch_buffer_size is still not fully clear to me.
Also, it states:
# number of passes to make over each train batch "num_sgd_iter": 1
but in practice if
num_sgd_iter=3 then it samples randomly 3 batches (each size
train_batch_size) from the
minibatch ring, and it doesn’t do 3 passes on each batch. Is that correct?