How does `shuffle_sequences` work in PPO?

Hi everyone,

I am not sure how does shuffle_sequences works in PPO nor which code snippets are responsible for shuffling the mini-batch. I have always set this setting to True but never knew how exactly it works.
I looked into every train_batch that is passed to the loss function, but it seems that the train batch is ordered, and I assume this because I saw that the sequence for unroll ids is ordered as well. e.g. [0 … 0 1 1 … 1 1 2 … 2].
The starting index (in terms of unroll id) of a mini-batch sampled from the train batch seems like being sampled randomly, but the sequence seems always consecutive.

@avnishn any thoughts here?

I looked into further, although I do see from Line 330 to 332:

 for _ in range(self.num_sgd_iter):
    permutation = np.random.permutation(num_batches)
    for batch_index in range(num_batches):

that is sort of doing random “shuffling” on batches.

self.shuffle_sequences is not used anywhere outside of __init__ in


Ppo does shuffle sequences but it is currently not configurable. The data is always shuffled. It happens here:

Thanks @mannyv, forgot to mention that I was using fractional GPU on my workers, so I suppose MultiGPUTrainOneStep was invoked instead of TrainOneStep. I do see do_minibatch_sgd is invoked in TrainOneStep but not in MultiGPUTrainOneStep. So I suppose the multi-gpu version of shuffling is done instead by sample a random offset values from the random permutation at each iteration?

Hi @mickelliu,

I dont use the multigputrainer so I am not sure in that case. A quick search through the code only revealed that they do “shuffle” the batches but if you only have one batch I could not find other forms of shuffling either.

@mannyv Thanks for your info. I think I got a good understanding of it now.