Hello everyone, as the title suggests, I’m trying to understand how these two parameters work for any off-policy algorithms such as QMIX. I have read a few posts and the doc but I still have difficulties fully understand the usage.
From my experiences of running QMIX on my custom Gym environment yesterday, I think buffer_size is the number of “iterations” collected by the workers that will be stored in the buffer. For example, if I have 15 workers and they collectively sample 15 episodes, this is counted as 1 iteration, and all of them will be stored in the buffer (given that I have set the mode to complete_episode). I knew this because I saw my ram utilization percentage flatten at the iteration number that is equal to my buffer_size.
Now comes the train_batch_size, which I believe is the number of steps from each worker’s episode that will be used for training. For example, if the mean episode length is 2300 and I set the train_batch_size equals 1000, and I have 15 workers worked for 1 iteration at a time, therefore the actual training batch will be the concatenated batch of the length 15*1000.
But what if I set the mode to truncated_episode and set the fragment length to 500? Will it automatically downsize the train_batch_size to be equal to the fragment length?
The size of that train batch is determined by the train_batch_size config parameter. Train batches are usually sent to the Policy’s learn_on_batch method, which handles loss- and gradient calculations, and optimizer stepping.
The training batch will be of size 1000 in your case. It does not matter how large the rollout fragments are or how many rollout workers you have - your batches will always be of size 1000.
Size of the replay buffer in batches (not timesteps!).
For other algorithms I would substitute the word iterations with experiences, because the buffer_size limits the number of individual samples collected by the rollout workers, which can be the same for various numbers of rollout workers, episodes and so on. The documentation describes it as follows for SAC:
I have another question if you don’t mind. My current understanding is that the concat_train_batch, which is a concatenation of all batches, it gets passed down to the SGD optimizer and updates the network once. So if train_batch_size is a hard limit on the size of this concatenated batch, what happens if every complete episode that is bigger than the training batch size (despite the concatenated batch is consisted of 15 batches)? So effectively the model was only trained on a single episode every time because the train_batch_size is so small relative to the full length of the concatenated batch?
Therefore, ideally you want to set train_batch_size = sample_batch_length * nums_of_workers.
Could you please confirm if I have understood this correctly? Thanks in advance.
So if we have rollout_fragment_length and num_workers and set train_batch_size = num_workers * rollout_fragment_length it should work nicely and make one large batch for training. But if we set train_batch_size to something smaller than this, will it just concatenate and split it up, and then as previous post mentioned it could result in only training on single episodes each time?
For PPO we only run on policy so we can’t save data, what happens if we can’t divide all data up into even piles of train_batch_size? Does it have a smaller final pile?