How valuable are bootstraped values for PPO training

1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.

I noticed that the amount of timesteps sampled are not equal to the batch the learner receives, e.g. I sample 1000 steps, however my batch has a size > 1000 - to be precise it is 1000 + #amount of episodes. I aim to have samples == train batch size.

I realized this is due to the AddOneTsToEpisodesAndTruncate in the Learner connector Pipeline that duplicates the last observation and action and appends it to the episode - If I do 4 steps, I have 4 actions but 5 observations. The reward is set to 0.

I.e a trajectory (O1, A1, R1 | O2, A2, R2 | O3) is extended to (O1, A1, R1 | O2, A2, R2 | O3, A2, 0), further down the pipeline the GAE is calculated from it.
For the loss this sample is masked, the VALUE_TARGET is 0, the respective advantage is non-zero.

Especially as the loss is masked I do wonder, does the O3 sample have any value when it reaches my PPOLearner?

In my setup I want sample size == batch size so I know that every sample is used equally (in expectation). But in practice I see the above described above situation that sample size > batch size.

For experiments I sometimes work with a very small sample size and I suspect that the current implementation poisons my experiments with unused or less valuable samples during training or at least is less efficient as during training one more iteration has to be done to include the appended samples.

Hi Daraan! I believe that the extra timestep should not be poisoning your experiments with unused samples during training, and it hypothetically should not be affecting the loss (although I think it might affect efficiency). This is discussed a bit in this doc Working with offline data — Ray 2.46.0, and specifically here in the code: ray/rllib/connectors/learner/add_one_ts_to_episodes_and_truncate.py at master · ray-project/ray · GitHub So I don’t think it is counted in the experiment.

Thank your very much for your reply. I verified that as expected the loss does not change. It therefore just feels inefficient to have them there. I wonder, woudn’t it make sense for PPO to have a different connector logic here that does not have this problem and compute the GAE in a different way?