1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.
I noticed that the amount of timesteps sampled are not equal to the batch the learner receives, e.g. I sample 1000 steps, however my batch has a size > 1000 - to be precise it is 1000 + #amount of episodes
. I aim to have samples == train batch size
.
I realized this is due to the AddOneTsToEpisodesAndTruncate
in the Learner connector Pipeline that duplicates the last observation and action and appends it to the episode - If I do 4 steps, I have 4 actions but 5 observations. The reward is set to 0.
I.e a trajectory (O1, A1, R1 | O2, A2, R2 | O3) is extended to (O1, A1, R1 | O2, A2, R2 | O3, A2, 0), further down the pipeline the GAE is calculated from it.
For the loss this sample is masked, the VALUE_TARGET
is 0, the respective advantage is non-zero.
Especially as the loss is masked I do wonder, does the O3 sample have any value when it reaches my PPOLearner?
In my setup I want sample size == batch size
so I know that every sample is used equally (in expectation). But in practice I see the above described above situation that sample size > batch size
.
For experiments I sometimes work with a very small sample size and I suspect that the current implementation poisons my experiments with unused or less valuable samples during training or at least is less efficient as during training one more iteration has to be done to include the appended samples.