Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Why is it that for rollout_fragement_length=1, using complete episode rollout takes significantly more episodes for TD3 to learn pendulum than truncated episode rollouts?
The only difference I found was that for truncated episodes, each timestep had a different unroll id while for complete episodes, all timesteps of the same episode had the same unroll id, but that doesn’t seem like it would influence training.
When you select complete_episodes the rollout_fragment_length specifies an lower bound on the number of timesteps to sample where as in truncate mode it is the exact number of timesteps.
Complete episodes with always rollout an entire episode and return that many timesteps.
So if rollout_fragment_length and an episode is 100 steps long before it terminates then you will get 100 time steps from that sample not 1.
On the other hand, if r_f_l is 200 and you have 5 episodes of length [10,10,20,30,200] then you will get a total of 270 time steps from that sample. The reason is because after the 4th episode you have only collected 70 time steps which is less than r_f_l.
The most likely cause of what you are observing is how many times the policy is updated in the two configurations. With a rollout_fragment_length of 1. In truncate mode TD3 would be updated after every step of the environment(s) whereas in complete_episodes mode it would be updated once every time an episode(s) terminates. If your episodes are long then that could lead to many fewer updates for complete_episodes versus truncate.