MultiAgentEnv: actions computed on stale observations

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

The problem I’m trying to solve involves agents taking actions that may affect the environment so that observations for the next agent (whose action has already been computed and stored in agent_dict) are stale. In other words, suppose we are computing the action for agent n. The action taken by agent n impacts the non-agent components of the environment such that the action for agent n+1 was computed using stale observations. What is the proper way to get around this?

Here is a simplified env.step() function that I am using:

  def step(self, action_dict):
      obs, rew, terminated, truncated, info = {}, {}, {}, {}, {}
      for id, action in action_dict.items():
          if (not self.terminateds[id]) and (not self.truncateds[id]):
              # here we should actually manually compute actions so we can update the observation for the agents (i.e. current price)
              (
                  obs[id],
                  rew[id],
                  terminated[id],
                  truncated[id],
                  info[id],
              ) = self._actual_agents[id].step(action=action)
      return obs, rew, terminated, truncated, info

To reiterate, each agent’s action might modify the environment. I want each subsequent agent to take the changes in the environment into account in their observations when computing their action. My concern about manually computing actions around line 5 is that Ray may have already “stored” the computed actions somewhere for the algorithms and if I change the action, the rewards that result from taking the new action will be attributed to the outdated action.

Two questions result:

  • In a MultiAgentEnv, what happens with the computed actions in between the time they are computed and passing of the action_dict to env.step()?
  • How would one go about updating the observations of each agent so that policies learn correctly and the algorithms associated the returned rewards with the correct input observations and action? To use the language of PettingZoo, what I would like to do is understand better how to implement an Agent-Environment Cycle (instead of a Parallel environment) using a pure Ray MultiAgentEnv (i.e., without using Ray’s PettingZoo AECEnv wrapper.)

Hi @jacob-thrackle,

It us probably possible but annoyingly complicated to do this by storing rewrite information in the info dictionary and then doing the rewrites in the on_postprocess_trajectory callback.

An alternative that may suit your needs is to have the environment cyclically present agents observations one at a time then you can resolve each agents action in order one at a time.

Thanks for the response; unfortunately it was what I expected, that it’s going to be ugly.

Can you elaborate a bit on how one might go about rewriting (what, exactly?) in on_postprocess_trajectory? I figure I’ll do the latter but it would be nice to have a cleaner way to deal with this problem that’s more “hidden” and likely portable to other users and envs.