I am not sure how does shuffle_sequences works in PPO nor which code snippets are responsible for shuffling the mini-batch. I have always set this setting to True but never knew how exactly it works.
I looked into every train_batch that is passed to the loss function, but it seems that the train batch is ordered, and I assume this because I saw that the sequence for unroll ids is ordered as well. e.g. [0 … 0 1 1 … 1 1 2 … 2].
The starting index (in terms of unroll id) of a mini-batch sampled from the train batch seems like being sampled randomly, but the sequence seems always consecutive.
Thanks @mannyv, forgot to mention that I was using fractional GPU on my workers, so I suppose MultiGPUTrainOneStep was invoked instead of TrainOneStep. I do see do_minibatch_sgd is invoked in TrainOneStep but not in MultiGPUTrainOneStep. So I suppose the multi-gpu version of shuffling is done instead by sample a random offset values from the random permutation at each iteration?
I dont use the multigputrainer so I am not sure in that case. A quick search through the code only revealed that they do “shuffle” the batches but if you only have one batch I could not find other forms of shuffling either.