Info dict keys and then add them as new entries of SampleBatch

Hey folks,

4th component returned from env is the info dict. Let’s assume for simplictiy that the env returns an info dict of the following form: {"a": ..., "b": ...}. In the postprocess_fn of my policy I would like to use these data from the info dict to do some calculations. Via sample_batch[SampleBatch.INFOS]["a"] or sample_batch[SampleBatch.INFOS]["b"] one can access the data, but during the building process of the policy RLlib feeds some “dummy” batch with zero arrays to the postprocess_fn and sample_batch[SampleBatch.INFOS] is an array and not a dict.

Thus, can I disassemble the info dict entries and then store each as a new entry in SampleBatch (before call of postprocessing function)?

Hi @klausk55,

Could you just check the type of info and ignore it if it is not a dict?

if not isinstance(info, dict):
    return
...

Probably I could do it this way or find a similar workaround. But is there a way to add new keys to the SampleBatch and their values are the values from the info dict? Maybe there is a callback function which can do this (e.g. on_episode_step?)

@klausk55,

You could do this in on_episode_step or postprocess_trajectory but it will not solve the issue you mentioned because those functions are not called during the initialization process. Which ever way you go you will need a special case to handle the initialization phase.

Another way that might work but I cannot look into it today is to set a ViewRequirement on a key in the info dictionary. I doubt it is set up to work like that currently but it would be worth a check.

1 Like

@klausk55 I found another topic here that deals with something similar and here is @sven1977 answer:

In the initial call to the loss function (during Policy setup), you should see an all-0s train_batch being passed into the loss function (including all possible SampleBatch columns).
Then, if you access some column in your loss function, RLlib will detect this and provide that column in all subsequent calls.
So I just tried this:

  • set breakpoint into PPOTorchPolicy’s loss function.
  • run the rllib/agents/ppo/tests/test_ppo.py::test_ppo_compilation_and_lr_schedule test case with ray.init(local_mode=True)
  • For the initial test call to the loss, I see “infos” in train_batch properly initialized with 0s.
  • Then, if I access this column in the loss function to tell RLlib that “infos” are needed (e.g. by printing train_batch[“infos”]), I do see this column also in all subsequent loss calls.

Following the @sven1977 's answer here the same should also hold for the postprocess_trajectory() function in your policy:

You will get env infos automatically in your loss or postprocessing function (if these functions need this field, i.e. access it in a test pass).

@mannyv essentially the idea with a ViewRequirement is really good and I like it! But, as you said, currently it isn’t set up to work like that, at least that’s my opinion after I’d looked into it today :see_no_evil:

ViewRequirement has an argument index about which I’m not sure how this argument really works. I did some “dummy tests” to see the effect of this argument, but it seems to me that nothing had changed?!

I would say that infos dict values whose structure/content may vary from timestep to timestep probably make it difficult or even impossible to set a ViewRequirement on a key in the infos dict.

Since on_postprocess_trajectory isn’t affected by the initialization process of RLlib (thanks for this hint @mannyv), I guess one way I can go is to outsource my calculations regarding advantages and value targets to this callback function. Then accessing postprocessed_batch["infos"]["actual_t"] should be fine and I might recalculate advantages and value targets.