1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
I’ve been working on a project that entailed going into episode trajectories and moving some things around to better serve a specialized shared critic. I noticed that each episode had an extra timestep at the end, and identified the AddOneTsToEpisodesAndTruncate connector attached in the PPOLearner class as the reason why.
Now, in that file, it says that this is necessary for VF bootstrapping, but I looked at compute_value_targets, which handles that calculation, and it seems like that already gets handled by the mask applied by continues. On the step where an episode terminates, the ‘next value’ is already completely ignored, and an extra zero is appended to the value list to facilitate this for the very last step.
Am I missing something, here? Was this connector used to make an earlier implementation of PPO work?
I should note that my implementation removes it, and trajectory calculations seem to work just fine.