Hi,
PPO (and DDPPO) has the option sgd_minibatch_size that splits train_batch_size to smaller chunks for each SGD iteration. However, APPO doesn’t have this option only minibatch_buffer_size.
So how does APPO do minibaching, and what controls the size of the minibaches?
Thanks
Edit:
From IMPALA (which should be the same for APPO):
 # == Overview of data flow in IMPALA ==
    # 1. Policy evaluation in parallel across `num_workers` actors produces
    #    batches of size `rollout_fragment_length * num_envs_per_worker`.
    # 2. If enabled, the replay buffer stores and produces batches of size
    #    `rollout_fragment_length * num_envs_per_worker`.
    # 3. If enabled, the minibatch ring buffer stores and replays batches of
    #    size `train_batch_size` up to `num_sgd_iter` times per batch.
    # 4. The learner thread executes data parallel SGD across `num_gpus` GPUs
    #    on batches of size `train_batch_size`.
So that means (assuming that the replay and minibatch ring buffers are both enabled), that train_batch_size in PPO roughly corresponds to rollout_fragment_length * num_envs_per_worker in APPO and sgd_minibatch_size to train_batch_size.
But the role or minibatch_buffer_size is still not fully clear to me.
Also, it states:
# number of passes to make over each train batch
"num_sgd_iter": 1
but in practice if num_sgd_iter=3 then it samples randomly 3 batches (each size train_batch_size) from the minibatch ring, and it doesn’t do 3 passes on each batch. Is that correct?