Accessing info dicts in postprocessing callback


I recently updated from rllib 0.8.6 to rllib 1.0.1 and also converted from TF to Torch. Previously, I was using the info dict from my custom environment to pass along values that I used in the postprocessing callback. However, now I don’t see the info key when I check the keys on the original batch (using original_batches[agent_id]) in my on_postprocess_trajectory callback. I only see :

(pid=17353) dict_keys([‘obs’, ‘actions’, ‘rewards’, ‘dones’, ‘eps_id’, ‘agent_index’, ‘new_obs’, ‘action_dist_inputs’, ‘action_logp’, ‘vf_preds’, ‘unroll_id’, ‘advantages’, ‘value_targets’])

I tried changing the view_requirements function to include SampleBatch.INFOS but this did not work. How can I go about seeing the info dicts in this callback?

1 Like

Thanks for asking this here @cwerner !
There is indeed a bug in 1.0.1, described (and fixed) here:
The workaround for this version is to set _use_trajectory_view_api to False in your Trainer’s config.
This has been fixed in subsequent versions (master and upcoming 1.2.x).

Hi @sven1977,

Any chance you could show an example of how to add a field from the info dict to the trajectory view api, rather than disabling it? I’m actually facing the same use case.


@roireshef, let me try to get this to work w/o disabling it. …

1 Like

Thanks, looking forward to it. I’ll be following this thread.

If I understood correctly, the trajectory view api is reducing memory footprint by avoiding streaming all the information through the entire pipeline. Please correct me if I’m wrong, I assumed it’s generally better to use it vs not to use it, since it should probably make things run faster.

It’d be nice to be able to customize it to include only those additional fields we want access to. I’m not sure how easy is it to do so, but I just couldn’t find any documentation on where to start from. If you have any example on how to add a field to it, it should be enough for me…


It’s actually quite hackish to do this on top of 1.0.1 (several files need to be changed b/c of bugs).
The recommended way is to use the current master (or upcoming 1.2.x), where this has been fixed.
You will get env infos automatically in your loss or postprocessing function (if these functions need this field, i.e. access it in a test pass).

Documentation is in-flight (doc PR is in review).

Yes, the speedup on Atari for PPO was ~20%. For more “learn-heavy” algos (lots of updates vs action inference) like DQN or SAC, it’s not really faster, but definitely not slower either.

Hi @sven1977 I am using current master with 1.2.0.dev0 wheel from few days ago. It’s only with “_use_trajectory_view_api” set to False that I see the info dict appears in postprocess_trajectory function.

I’m on commit 407a3523f367a1e2f124b4bdfb1aef3a2d4340a7 from today.

In the initial call to the loss function (during Policy setup), you should see an all-0s train_batch being passed into the loss function (including all possible SampleBatch columns).
Then, if you access some column in your loss function, RLlib will detect this and provide that column in all subsequent calls.

So I just tried this:

  • set breakpoint into PPOTorchPolicy’s loss function.
  • run the rllib/agents/ppo/tests/ test case with ray.init(local_mode=True)
  • For the initial test call to the loss, I see “infos” in train_batch properly initialized with 0s.
  • Then, if I access this column in the loss function to tell RLlib that “infos” are needed (e.g. by printing train_batch[“infos”]), I do see this column also in all subsequent loss calls.

Ah yes, I do see another error in torch now that has to do with the attempted conversion of the TrackingDict (from {} to a torch tensor, which fails).
Will fix this right now. …

I was wondering how QMIX works (it does access the “infos” column for each train setp), but it completely overrides learn_on_batch, so there is no tracking dict!

PR with a fix:

I am noticing a similar issue with the current master. Sven, can you elaborate on how exactly the auto detection of required fields is intended to work? In the postprocessing function, I am currently extracting rewards from the info dict and those rewards are then processed to advantages. If the info dict is populated with 0 in the first pass, I cannot compute these advantages, and if I don’t compute the advantages, the loss function will not be able to do anything meaningful. Do I need a separate handling of the initial pass to the postprocessing/loss function that detects if the info dict is populated with 0 or with an actual dict?