Query policy from within environment, without logging action?

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’ve noticed that in BaseEnv and ExternalEnv there’s a few references to logging off-policy actions, i.e. actions that happened in the environment without querying the policy for them, but we still log them to the sample batch for training.

I wonder if the opposite is also possible: Query the policy for an action, but keep it out of the sample batch?

I can think of two ways of doing it, curious if there’s any better ideas though:

  1. Pass a reference to the policy into the env using a callback, and then call policy.compute_single_action() in the env. This would completely bypass anything to do with the sample batch, but the annoying thing is that I might want to do this from an environment that’s wrapped in some wrappers that might change what the observation looks like, so I’d have to manually duplicate that logic at the point where I want to query for an action.
  2. Mark the action is “invisible” using an entry in the infos dict, and then use an on_postprocess_trajectory callback to remove those steps from the sample batch. This way I wouldn’t have to worry about any sort of observation processing, but I’d have to make sure I adjust the rewards and advantages in the right way.

There isn’t any easier built-in way to do this, right?

And for my specific use-case: If I do it using the callback (2), and if the observations to be deleted are always at the beginning of the episode and had no non-zero rewards, then I could just remove them without worrying about advantages, correct? Because those are only ever looking forward, not back. Or would the presence of additional steps at the beginning of an episode affect things in other ways? (I’m happy if it works even with PG, in case it’s an issue only in more elaborate algorithms.)

Thank you all!!

Hi @mgerstgrasser,

You are correct that would need to use a callback here because the env has no access to the policy.

You should consider the on_episode_step callback. That one has all the pieces you need and it even has the episode object which you can use to store temporary data and custom_metrics.

A few things to keep in mind with this callback.

It will get called on every rollout worker for each env that it has (if num_envs_per_worker >1).

This may have changed but back when I used this approach the callback was invoked after the actions provided in step were applied. This was fine except for the first step of the emvironment. I had to use on_episode_start to handle this and not miss the first observation.

The reason I like this approach is nothing gets added to the sample batch unless you decide to add it.

@mannyv Ah, interesting idea - I had seen that on_episode_step wasn’t being called on the first (reset) step, so I dismissed this as an option, but you’re right, in combination with on_episode_start and on_episode_end you could get everything covered.

However, I don’t see how I can modify what’s being sent to the sample batch from these callbacks? I only get the Episode object, but I can’t find a way to use this to delete data. I can add custom metrics, but it seems otherwise the Episode object is pretty much just a read-only view of the last step.

Hi @mgerstgrasser,

Yes you are right from the perspective of the sample batch it is essentially read-only. It technically isn’t because if you get the last obs and raw obs for an agent then you can change date in the numpy array and that change will be reflected in the samples buffer since it is holding a reference to the same object. But if you change the numpy object then that will not be.

But if you wanted to do extra inferencing you could do that and add them to the info dictionary. The samples buffer us also holding a reference to that dictionary so if you add or modify a key that change will be reflected back in the buffer.

Another approach:
I was thinking back to a post a few days ago when you asked about getting the probabilities of the actions and I suggested you could return all the probabilities by creating a custom policy that adds them in the extra_actions_out method. I am wondering if you can do the same thing here.

You would need to add the new keys you add to the ViewRequirements.

I guess this is all to say you have several approaches you could take but none of them are really “straightforward.”

Perhaps this is a use case for Connectors but as far as I know (which is addmitedly not much) those are still being developed.

Yep, I figured out in the end several ways:

  1. Use the on_postprocess_trajectory callback - I think for simple algorithms removing things after postprocess doesn’t mess things up. Might be a problem for more complex algorithms, e.g. if they normalise advantages and include the “hidden” steps in that.
  2. Hack in my own special pre-postprocess callback and remove things before postprocessing.
  3. Use on_subenv_created and/or on_episode_start to pass a reference to the worker and/or policies into the environment, then directly call policy.compute_single_action in the environment. This completely bypasses the entire rllib sample collection pipeline, so should definitely do the trick, and has the added advantage that you could e.g. switch exploration or or off in the call as you desire, but you do have to preprocess the obs there yourself.

In any case, all of those had the same effect in my setting. None of the above are perfect, but it was just for a single experiment, so good enough.

Thank you for your help, as always!