How does `shuffle_sequences` work in PPO?

mickelliu · February 10, 2022, 12:47pm

Hi everyone,

I am not sure how does shuffle_sequences works in PPO nor which code snippets are responsible for shuffling the mini-batch. I have always set this setting to True but never knew how exactly it works.
I looked into every train_batch that is passed to the loss function, but it seems that the train batch is ordered, and I assume this because I saw that the sequence for unroll ids is ordered as well. e.g. [0 … 0 1 1 … 1 1 2 … 2].
The starting index (in terms of unroll id) of a mini-batch sampled from the train batch seems like being sampled randomly, but the sequence seems always consecutive.

amogkam · February 10, 2022, 11:29pm

@avnishn any thoughts here?

mickelliu · February 11, 2022, 11:33am

I looked into train_ops.py further, although I do see from Line 330 to 332:

...
 for _ in range(self.num_sgd_iter):
    permutation = np.random.permutation(num_batches)
    for batch_index in range(num_batches):
...

that is sort of doing random “shuffling” on batches.

self.shuffle_sequences is not used anywhere outside of __init__ in train_ops.py.

mannyv · February 11, 2022, 7:11pm

@mickelliu,

Ppo does shuffle sequences but it is currently not configurable. The data is always shuffled. It happens here:

github.com

ray-project/ray/blob/49b3e6c53c8ca380a0797cafdd3384cb8a140a80/rllib/utils/sgd.py#L25

      
        
            
            
    Args:
                    array (np.ndarray): Array of values to normalize.
            
            
    Returns:
                    array with zero mean and unit standard deviation.
                """
                return (array - array.mean()) / max(1e-4, array.std())
            
            

            
def minibatches(samples: SampleBatch, sgd_minibatch_size: int, shuffle: bool = True):
                """Return a generator yielding minibatches from a sample batch.
            
            
    Args:
                    samples: SampleBatch to split up.
                    sgd_minibatch_size: Size of minibatches to return.
                    shuffle: Whether to shuffle the order of the generated minibatches.
                        Note that in case of a non-recurrent policy, the incoming batch
                        is globally shuffled first regardless of this setting, before
                        the minibatches are generated from it!

mickelliu · February 12, 2022, 9:08am

Thanks @mannyv, forgot to mention that I was using fractional GPU on my workers, so I suppose MultiGPUTrainOneStep was invoked instead of TrainOneStep. I do see do_minibatch_sgd is invoked in TrainOneStep but not in MultiGPUTrainOneStep. So I suppose the multi-gpu version of shuffling is done instead by sample a random offset values from the random permutation at each iteration?

mannyv · February 12, 2022, 12:25pm

Hi @mickelliu,

I dont use the multigputrainer so I am not sure in that case. A quick search through the code only revealed that they do “shuffle” the batches but if you only have one batch I could not find other forms of shuffling either.

github.com

ray-project/ray/blob/04a5c72ea361af7cddf4a5676b61b5f2bed95571/rllib/execution/train_ops.py#L129

      
        
            # This makes sure results dicts always have the same structure
            # no matter the setup (multi-GPU, multi-agent, minibatch SGD,
            # tf vs torch).
            learner_info_builder = LearnerInfoBuilder(num_devices=num_devices)
            
            
for policy_id, samples_per_device in num_loaded_samples.items():
                policy = local_worker.policy_map[policy_id]
                num_batches = max(1, int(samples_per_device) // int(per_device_batch_size))
                logger.debug("== sgd epochs for {} ==".format(policy_id))
                for _ in range(num_sgd_iter):
                    permutation = np.random.permutation(num_batches)
                    for batch_index in range(num_batches):
                        # Learn on the pre-loaded data in the buffer.
                        # Note: For minibatch SGD, the data is an offset into
                        # the pre-loaded entire train batch.
                        results = policy.learn_on_loaded_batch(
                            permutation[batch_index] * per_device_batch_size, buffer_index=0
                        )
            
            
            learner_info_builder.add_learn_on_batch_results(results, policy_id)

mickelliu · February 12, 2022, 1:00pm

@mannyv Thanks for your info. I think I got a good understanding of it now.

Topic		Replies	Views
Shuffling sequences with LSTM RLlib	1	715	July 31, 2021
PPO order of actions/obs/rewards scrambled RLlib	1	473	January 15, 2022
How are minibatches spliced RLlib	15	1436	November 11, 2021
Reproducibility of ray.tune with seeds RLlib	6	3052	July 26, 2022
Confusing behavior in PPO training loop (train_batch_size, sgd_minibatch_size, num_sgd_iter) RLlib	1	535	July 27, 2022

How does `shuffle_sequences` work in PPO?

Related topics