How valuable are bootstraped values for PPO training

Daraan · June 3, 2025, 1:42pm

1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.

I noticed that the amount of timesteps sampled are not equal to the batch the learner receives, e.g. I sample 1000 steps, however my batch has a size > 1000 - to be precise it is 1000 + #amount of episodes. I aim to have samples == train batch size.

I realized this is due to the AddOneTsToEpisodesAndTruncate in the Learner connector Pipeline that duplicates the last observation and action and appends it to the episode - If I do 4 steps, I have 4 actions but 5 observations. The reward is set to 0.

I.e a trajectory (O1, A1, R1 | O2, A2, R2 | O3) is extended to (O1, A1, R1 | O2, A2, R2 | O3, A2, 0), further down the pipeline the GAE is calculated from it.
For the loss this sample is masked, the VALUE_TARGET is 0, the respective advantage is non-zero.

Especially as the loss is masked I do wonder, does the O3 sample have any value when it reaches my PPOLearner?

In my setup I want sample size == batch size so I know that every sample is used equally (in expectation). But in practice I see the above described above situation that sample size > batch size.

For experiments I sometimes work with a very small sample size and I suspect that the current implementation poisons my experiments with unused or less valuable samples during training or at least is less efficient as during training one more iteration has to be done to include the appended samples.

christina · June 3, 2025, 10:41pm

Hi Daraan! I believe that the extra timestep should not be poisoning your experiments with unused samples during training, and it hypothetically should not be affecting the loss (although I think it might affect efficiency). This is discussed a bit in this doc Working with offline data — Ray 2.46.0, and specifically here in the code: ray/rllib/connectors/learner/add_one_ts_to_episodes_and_truncate.py at master · ray-project/ray · GitHub So I don’t think it is counted in the experiment.

Daraan · June 17, 2025, 12:45pm

Thank your very much for your reply. I verified that as expected the loss does not change. It therefore just feels inefficient to have them there. I wonder, woudn’t it make sense for PPO to have a different connector logic here that does not have this problem and compute the GAE in a different way?

Topic		Replies	Views
RLLIB PPO error on non-finished episodes RLlib	2	361	January 13, 2023
What is the purpose of AddOneTsToEpisodesAndTruncate? RLlib	0	24	August 13, 2025
PPO bug: States' values aren't counted if the next action terminates the episode? RLlib	4	56	October 15, 2025
Changing add_time_dimension logic RLlib	9	495	July 6, 2023
Vf_preds not in SampleBatch (for PPO) Configure Algorithm, Training, Evaluation, Scaling	3	238	December 4, 2024

How valuable are bootstraped values for PPO training

Related topics