Multi-Agent cyclic games with paused agents

Aceticia · September 26, 2021, 7:58pm

Hi, I’m trying to implement an env where there are 3 phases:

Agent 1 goes for M steps
Agent 2 goes for N steps
End of episode, both agent get reward

From the docs, I think it’s pretty clear that the way to specify which agent moves next is by returning the keys of obs dict. For example, when the game just started, only “agent1” is in the observation dict, and when we transition to the second phase, only “agent2” is in the observation dict. However, since the reward is given at the end for both agents, I have to create a reward dict for both agent 1 and agent 2. How should I work around this?

My thought: If I return a dummy obs dict in the 3rd phase, agent 1 should receive the reward properly. Will this produce any unwanted side effects?

mannyv · September 26, 2021, 8:55pm

Hi @Aceticia,

At the pond of the episode I think all you should have to do is return a dictionary with a reward for each agent.

If you look at this code here, especially 774-779 you will see that when the env returns all_done =True, it will create an empty obs for each agent that is not in the final observation.

Personally, I would do it myself in my env so that the semantics were really clear but based on this code it should work fine either way.

@sven1977 or @gjoliver can you confirm?

github.com

ray-project/ray/blob/90d2456ec70270a1f894ec3ef6f3004533859e03/rllib/evaluation/sampler.py#L752-L777

    
      
          if dones[env_id]["__all__"] or episode.length >= horizon:
              hit_horizon = (episode.length >= horizon
                             and not dones[env_id]["__all__"])
              all_agents_done = True
              atari_metrics: List[RolloutMetrics] = _fetch_atari_metrics(
                  base_env)
              if atari_metrics is not None:
                  for m in atari_metrics:
                      outputs.append(
                          m._replace(custom_metrics=episode.custom_metrics))
              else:
                  outputs.append(
                      RolloutMetrics(episode.length, episode.total_reward,
                                     dict(episode.agent_rewards),
                                     episode.custom_metrics, {},
                                     episode.hist_data, episode.media))
              # Check whether we have to create a fake-last observation
              # for some agents (the environment is not required to do so if
              # dones[__all__]=True).
              for ag_id in episode.get_agents():

This file has been truncated. show original

sven1977 · September 27, 2021, 8:36am

That’s correct @mannyv But yes, it’s always better to do this properly in the env. However, to add stability, we fixed this a while ago. Before, RLlib would break, if the env did not publish these obs at the end. You can also now publish rewards at any point for any agent (evem the ones that did not step) and RLlib will automatically sum up the recent rewards for these agents. This makes it easier to build turn-based game envs where agent A receives a reward as a result of agent B’s action, even though agent A did not do anything.

Topic		Replies	Views
How to distribute the final reward among agents in a fully-cooperative turn-taking environmet? RLlib	4	280	October 28, 2021
Setting multi agent early exit from Custom Env RLlib	5	607	April 15, 2024
Interaction between env and policy in multi agent environment RLlib	4	382	November 27, 2021
Multi-agent Env with different reward functions for different agents? RLlib	6	406	September 14, 2021
Reward returns in hierarchical multi-agent-env RLlib	1	385	March 23, 2022

Multi-Agent cyclic games with paused agents

Related topics