Extracting and storing per step agent state from RLlib rollouts


I am an ML researcher looking to assess the competency of RL agents. As part of this, when I execute rollouts I want to collect a large amount of information about the agent’s internal state as well as the current observation, action, etc… Moreover, I ideally would want to do this in a way that is agnostic to the RL algorithm (e.g. check if a q function exists and, if it is, take the q_value distribution for a given state-acton tuple).

I have been looking into custom callbacks, but am not sure if this will allow me to accomplish this goal. In short, I want to be able to run rollouts in a distributed setting and have it output a dictionary (indexed by PID and core-specific episode, perhaps) with episode keys and a subdictionary that is indexed by timestep and includes all this information as key-value pairs. Can you help me with this?

Thanks so much in advance.

Perhaps you can add all this information in the Torch policy template: ray/torch_policy_template.py at master · ray-project/ray · GitHub

stats_fn returns a dictionary of the statistics that you need. You can override the function appropriately to include the stuff you said above.

Hi @Samuel_Showalter,

Welcome to the forum. It would be easier to answer your question if you could provide an expanded list of the types of information you would want to collect. The standard RL information is pretty easy to extract. But others would be much harder.

As a start you could add the following to your config and then figure out what other kind of information you want to augment it with.
`config[“output”] = “/path/to/store/data/”

This will create a set of json files in that directory with some information. As a start I would recommend only training for a couple steps because the data can become quite large quickly.

(accidentally posted this before it was complete)
Hi all,

Thanks so much for the ideas. I found a path forward that seems to be working, which is to extend the default callback class and access the internal state of the models by calling the Episode object to get the policy. Then I augment the on_episode_step function to process the observation and then output things like .value_function(), ’get_q_distributions (where applicable),etc.

Are there policy methods that are standard across all agents? i.e. is .value_function() and the action distribution consistent?