PPO algorithms train buffer only collects the first fragment from each worker?

mannyv · October 9, 2021, 11:33am

This has a pretty good overview of how sample collection works:

https://docs.ray.io/en/latest/rllib-sample-collection.html

During the execution plan each worker will be asked to produce samples from new rollouts of the current model. Assuming you are using the truncate_episodes batch_mode, each worker will roll out exactly num_envs_per_worker * rollout_fragment_length that is how you get a total sample size of num_workers*num_envs_per_worker * rollout_fragment_length.

RLLIB requires that the train_batch_size is a multiple of that total_sample_size.

RLLIB with the default settings does not terminate your environment artificially. If a worker collects ‘rollout_fragment_length’ timesteps (t[0:49]) from an episode it will pause the environment and return those samples for training. On the next call for new samples it will resume the environment from where it left off (t[50]).

I don’t know anything about your environment but keep in mind that if it does not have fixed size episode lengths or that length is not a multiple of 50 than you do get mixed episode. which in general is not a problem. For example if you had an environment a fixed step size of 60 then your first sample call would have a samplebatch with e0_t[0:49] but your second sample call would have samples from two episodes. The last 10 of the first episode and the first 40 of the second episode ([e0_t[50:59],e1_t[0:39]]).

If you want to sample ONLY a fixed number of steps from your environment then you can use the horizon key in the config to have rllib artificially terminate your environment after that many steps.

Topic		Replies	Views
Needs help on understanding `buffer_size` and `train_batch_size` RLlib	4	1172	October 30, 2021
[RLlib] Batch size for complete_episodes issue RLlib	6	2156	February 3, 2022
Pong PPO from tuned example v2.4.0 not converging RLlib	4	475	May 27, 2023
Why auto-adjust `rollout_fragment_length` by a floor division instead of ceiling operation? RLlib	2	358	June 9, 2022
PPO configuration parameters: num_rollout_workers & train_batch_size Configure Algorithm, Training, Evaluation, Scaling	1	781	November 2, 2023

PPO algorithms train buffer only collects the first fragment from each worker?

Related topics