I am working on generative adversarial imitation learning with Rllib. I would like to understand whether it is possible to use rollout_fragment_length, horizon and soft_horizon parameters to do the following experiment for stabilizing training. Imagine that I use PPO with GAE and that I want to build a training batch composed of several episod trajectories where each trajectory has at most T steps. Each episod trajectories must be divided in peaces of length h such that bootstrapping is applied at the end of each peaces.
- As far as I understand,horizon enforce rollout worker to reset when horizon is reached so it should be set to T because I don t necessarily want to stop rollout at the first h steps of an episod.
- rollout_fragment_length should be set to h such that bootstrapping is applied regularly after h steps in an episod and not at the end of the episod which is T.
- I have to put batch_size > k.T to have several trajectories in the training batch.
I think that I can put soft horizon to false because I want to reset after T steps.
Does this setting have the the effect that I describe ?