Hi,
PPO (and DDPPO) has the option sgd_minibatch_size
that splits train_batch_size
to smaller chunks for each SGD iteration. However, APPO doesn’t have this option only minibatch_buffer_size
.
So how does APPO do minibaching, and what controls the size of the minibaches?
Thanks
Edit:
From IMPALA (which should be the same for APPO):
# == Overview of data flow in IMPALA ==
# 1. Policy evaluation in parallel across `num_workers` actors produces
# batches of size `rollout_fragment_length * num_envs_per_worker`.
# 2. If enabled, the replay buffer stores and produces batches of size
# `rollout_fragment_length * num_envs_per_worker`.
# 3. If enabled, the minibatch ring buffer stores and replays batches of
# size `train_batch_size` up to `num_sgd_iter` times per batch.
# 4. The learner thread executes data parallel SGD across `num_gpus` GPUs
# on batches of size `train_batch_size`.
So that means (assuming that the replay
and minibatch ring
buffers are both enabled), that train_batch_size
in PPO roughly corresponds to rollout_fragment_length * num_envs_per_worker
in APPO and sgd_minibatch_size
to train_batch_size
.
But the role or minibatch_buffer_size
is still not fully clear to me.
Also, it states:
# number of passes to make over each train batch
"num_sgd_iter": 1
but in practice if num_sgd_iter=3
then it samples randomly 3 batches (each size train_batch_size
) from the minibatch ring
, and it doesn’t do 3 passes on each batch. Is that correct?