Multi-Agent cyclic games with paused agents

Hi, I’m trying to implement an env where there are 3 phases:

  1. Agent 1 goes for M steps
  2. Agent 2 goes for N steps
  3. End of episode, both agent get reward

From the docs, I think it’s pretty clear that the way to specify which agent moves next is by returning the keys of obs dict. For example, when the game just started, only “agent1” is in the observation dict, and when we transition to the second phase, only “agent2” is in the observation dict. However, since the reward is given at the end for both agents, I have to create a reward dict for both agent 1 and agent 2. How should I work around this?

My thought: If I return a dummy obs dict in the 3rd phase, agent 1 should receive the reward properly. Will this produce any unwanted side effects?

1 Like

Hi @Aceticia,

At the pond of the episode I think all you should have to do is return a dictionary with a reward for each agent.

If you look at this code here, especially 774-779 you will see that when the env returns all_done =True, it will create an empty obs for each agent that is not in the final observation.

Personally, I would do it myself in my env so that the semantics were really clear but based on this code it should work fine either way.

@sven1977 or @gjoliver can you confirm?


1 Like

That’s correct @mannyv :slight_smile: But yes, it’s always better to do this properly in the env. However, to add stability, we fixed this a while ago. Before, RLlib would break, if the env did not publish these obs at the end. You can also now publish rewards at any point for any agent (evem the ones that did not step) and RLlib will automatically sum up the recent rewards for these agents. This makes it easier to build turn-based game envs where agent A receives a reward as a result of agent B’s action, even though agent A did not do anything.