Horizon curriculum in generative adversarial imitation learning

Hi all,
I am working on generative adversarial imitation learning with Rllib. I would like to understand whether it is possible to use rollout_fragment_length, horizon and soft_horizon parameters to do the following experiment for stabilizing training. Imagine that I use PPO with GAE and that I want to build a training batch composed of several episod trajectories where each trajectory has at most T steps. Each episod trajectories must be divided in peaces of length h such that bootstrapping is applied at the end of each peaces.

  • As far as I understand,horizon enforce rollout worker to reset when horizon is reached so it should be set to T because I don t necessarily want to stop rollout at the first h steps of an episod.
  • rollout_fragment_length should be set to h such that bootstrapping is applied regularly after h steps in an episod and not at the end of the episod which is T.
  • I have to put batch_size > k.T to have several trajectories in the training batch.
    I think that I can put soft horizon to false because I want to reset after T steps.

Does this setting have the the effect that I describe ?

As I understand it, this seems to be correct. So:

# reset env after T steps
config["horizon"] = T
# collect rollouts of length h from each worker
config["rollout_fragment_length"] = h

The combined experiences from rollouts of length h are concatenated to the train batch of the configured "train_batch_size", which is used for SGD.
In SGD, PPO does "num_sgd_iter" iterations over the training batch and divides it into mini batches of size "sgd_minibatch_size" (shuffling the experiences by default).
Correct, you can keep config["soft_horizon"] = False if you do want to reset your env after T steps.

I think the config params are quite well documented:

Did you test if this works and does what you want?

1 Like

Thanks for the answer . So to be more specific my concern is more about how the GAE is estimated in this specific context. As far as I understand if we set rollout_fragment_length =h and horizon to T with k*h < T and a training batch size equal to T then ( assuming that I have one worker) my epsiod will end after T steps assuming that the agent is not eliminated by simulator earlier . In my training batch, I will have contiguous trajectory peaces of length h which means that the GAE will be estimated for chunck of h steps repeatedly k times separately up to end of episod.

So I assume that the gradient will be averaged over k peaces of trajectory of length h and not of length T like in original formulation . [ Counter example with GAE estimated over T steps]Right ?
I didn t find a way to check that GAE is estimated with h steps and not T steps.

Hi @Koeberle_Yann,
If I am reading this correctly, you are saying that you want the episode to end whenever the environment is done at step T. If that is the case then you can leave horizon to its default value of None. When it is None, it lets the environment control when an episode ends. If you want to artificially end an episode after a max number of timesteps that might be less than when the environment would naturally end or if the environment never ends that is when you would set horizon.

Hi @mannyv . Well I may end my episod at a specific T so that I don’t go too far from my expert support at the beginning when policy is not very realistic and when covariate shift is high. Two horizon curriculum are possible : either I run policy up to T steps with rollout_fragment_length=T and I increase T in my curriculum either I let the policy run for T steps at most and I set h < T and I increase h in my curriculum . In this way, I estimate GAE not on the full trajectory of length T but on chuncks of length h. I am not sure if the last sentence is true and did not find a way to check it.