Interaction between env and policy in multi agent environment

Blubberblub · November 27, 2021, 10:31am

Hey everyone,
i’m building a multi agent environment based on the MultiAgentEnv class. For testing purposes i provide it with some hardcoded sample action_dicts. Some of my agents are done early so i was wondering how the policy changes the action_dict after it receives obs,rew,done,inf with some agents having done set to true . Are the agents that are done simply excluded from the next action_dict returned by the policy?

# action dict received by env
action_dict = {a:1,b:0,c:1}

# step return
obs =  {a:99,b:123,c:99}
rew =  {a:0,b:1,c:0}
done = {a:True,b:False,c:False,__all__:False}
inf = {}

# next action dict?
action_dict = {b:0,c:1}

Lars_Simon_Zehnder · November 27, 2021, 3:29pm

Hi @Blubberblub ,

without knowing exactly how the MultiAgent case is handled, maybe you take a look into the SyncSampler or AsyncSampler at this line where the policy gets evaluated and returns the actions. The _env_runner() functions actually performs the sampling. Set a breakpoint at the line I referred to above and debug your program.

Hope this helps

mannyv · November 27, 2021, 3:39pm

Hi @Blubberblub

Yes, once an agent returns it will no longer get an entry in the step return dictionaries. You will have to omit that agent’s key in subsequent calls to step. The last one you provide for it on the step when it was done will not be passed to the policy for actions.

An agent cannot be done and then come back to life.

At the very end of the episode all agents that are missing on that last time step will get a reward of 0.

A lot of environments are designed to always provide every agent on every step. What they do in this case is include a noop action and when an agent is “done” they will provide an action mask indicating that agent may only select the noop action.

Either of these two approaches will work. Which one you choose is a design choice you get to make.

mannyv · November 27, 2021, 4:10pm

@Blubberblub,

Dealing with samples is quite complicated and little convoluted in the code but here is where it decides to pass the obs to the policy for actions

github.com

ray-project/ray/blob/116bda8f05104353a7f95b18e70566a0b598bc2f/rllib/evaluation/sampler.py#L902-L910

    
      
          if not agent_done:
              item = PolicyEvalData(
                  env_id, agent_id, filtered_obs, agent_infos, None
                  if last_observation is None else
                  episode.rnn_state_for(agent_id), None
                  if last_observation is None else
                  episode.last_action_for(agent_id), rewards[env_id].get(
                      agent_id, 0.0))
              to_eval[policy_id].append(item)

Blubberblub · November 27, 2021, 5:09pm

Thanks a lot for the answers! That help clarify how thinks working and provided a lot of additional insight. Thanks especially to @mannyv to giving hints on how to design these systems.

Topic		Replies	Views
Multi agent sequential actions Configure Algorithm, Training, Evaluation, Scaling	0	37	June 27, 2024
How should you end a MultiAgentEnv episode? RLlib	16	1314	October 1, 2022
Multi agent unique actions Configure Algorithm, Training, Evaluation, Scaling	1	71	May 29, 2024
Different step space for different agents RLlib	7	841	August 11, 2021
Policy mapping for computing actions in multi agent env RLlib	8	1229	January 2, 2022

Interaction between env and policy in multi agent environment

Related topics