PPO algorithms train buffer only collects the first fragment from each worker?

Hello everyone, although I have had major successes training PPO models with RLlib, I still have difficulties understanding the mechanism for trajectory collection and particularly how train_batch_size and rollout_fragment_length affect such mechanism.

My particular case is that I wanna keep fragments of length = 50 in a fixed size train batch, that is, the workers keep filling this training batch with size 50 trajectories until the batch is full. However, RLllib will always pop up a warning and auto-adjust my fragment size to let train_batch_size be exactly equal to rollout_fragment_length * number_of_workers*number_of_envs_per_worker. So I guess that “accumulated into an extra-large train batch” suggested here: 'rollout_fragment_length' and 'truncate_episodes' · Issue #10179 · ray-project/ray · GitHub won’t happen anymore.

Can I safely assume that the training batch is therefore all filled with the very first fragments collected from all workers, and other fragments after the first one will not be a part of any training batch (besides being used for calculating the tune.episode_reward)?

Also, what I really wanted to do is to fill the training batch with fragments from the least number of episodes possible. For example, if the episode length is 1,000 and the fragment length is 100, I want all those 10 fragments to be present in my training batch instead of having 10 workers to collect the 10 first fragments.


This has a pretty good overview of how sample collection works:


During the execution plan each worker will be asked to produce samples from new rollouts of the current model. Assuming you are using the truncate_episodes batch_mode, each worker will roll out exactly num_envs_per_worker * rollout_fragment_length that is how you get a total sample size of num_workers*num_envs_per_worker * rollout_fragment_length.

RLLIB requires that the train_batch_size is a multiple of that total_sample_size.

RLLIB with the default settings does not terminate your environment artificially. If a worker collects ‘rollout_fragment_length’ timesteps (t[0:49]) from an episode it will pause the environment and return those samples for training. On the next call for new samples it will resume the environment from where it left off (t[50]).

I don’t know anything about your environment but keep in mind that if it does not have fixed size episode lengths or that length is not a multiple of 50 than you do get mixed episode. which in general is not a problem. For example if you had an environment a fixed step size of 60 then your first sample call would have a samplebatch with e0_t[0:49] but your second sample call would have samples from two episodes. The last 10 of the first episode and the first 40 of the second episode ([e0_t[50:59],e1_t[0:39]]).

If you want to sample ONLY a fixed number of steps from your environment then you can use the horizon key in the config to have rllib artificially terminate your environment after that many steps.

Hi @mannyv, I really appreciate your answer, as you suggested I will make sure the horizon is at least a multiple of the rollout_fragment_length.

I have another problem if you don’t mind. Do you know if we can achieve something that OpenAI has done in their Hide and Seek paper? (Section B.5., Page 24) https://arxiv.org/pdf/1909.07528.pdf. To summarize, they have done two things:

  1. Their eposide length is at least 240 steps, and the fragment length is 160. In each training iteration, PPO buffer collects fragment with various beginning & end timestep (some are e0_t[0:159], some are e0_t[160:320], and so on). From your description of how RLlib trajectory collection works, I think, at each iteration, RLlib prioritizes on collecting fragments of the same beginning & end timestep to the training batch from workers?

  2. OpenAI further formats (BPTT truncation) the size 160 fragment into 16 chunks of size 10 fragment, and every SGD minibatch is set to collect 64,000 chunks of 10 timesteps. I don’t think this is yet a built feature in RLlib, correct?

Thank you so much for your time.

Hi @mickelliu,

I just noticed your followup question.

I think you would use these settings:

rollout_fragment_length: 160
train_batch_size: 320000
sgd_minibatch_size: 64000
model{max_seq_len: 10}

You will want to make sure that

train_batch_size % (num_workers * rollout_fragment_length) == 0