PPO algorithms train buffer only collects the first fragment from each worker?

@mickelliu,

This has a pretty good overview of how sample collection works:

https://docs.ray.io/en/latest/rllib-sample-collection.html

During the execution plan each worker will be asked to produce samples from new rollouts of the current model. Assuming you are using the truncate_episodes batch_mode, each worker will roll out exactly num_envs_per_worker * rollout_fragment_length that is how you get a total sample size of num_workers*num_envs_per_worker * rollout_fragment_length.

RLLIB requires that the train_batch_size is a multiple of that total_sample_size.

RLLIB with the default settings does not terminate your environment artificially. If a worker collects ‘rollout_fragment_length’ timesteps (t[0:49]) from an episode it will pause the environment and return those samples for training. On the next call for new samples it will resume the environment from where it left off (t[50]).

I don’t know anything about your environment but keep in mind that if it does not have fixed size episode lengths or that length is not a multiple of 50 than you do get mixed episode. which in general is not a problem. For example if you had an environment a fixed step size of 60 then your first sample call would have a samplebatch with e0_t[0:49] but your second sample call would have samples from two episodes. The last 10 of the first episode and the first 40 of the second episode ([e0_t[50:59],e1_t[0:39]]).

If you want to sample ONLY a fixed number of steps from your environment then you can use the horizon key in the config to have rllib artificially terminate your environment after that many steps.