Interaction between env and policy in multi agent environment

Hey everyone,
i’m building a multi agent environment based on the MultiAgentEnv class. For testing purposes i provide it with some hardcoded sample action_dicts. Some of my agents are done early so i was wondering how the policy changes the action_dict after it receives obs,rew,done,inf with some agents having done set to true . Are the agents that are done simply excluded from the next action_dict returned by the policy?

# action dict received by env
action_dict = {a:1,b:0,c:1}

# step return
obs =  {a:99,b:123,c:99}
rew =  {a:0,b:1,c:0}
done = {a:True,b:False,c:False,__all__:False}
inf = {}

# next action dict?
action_dict = {b:0,c:1}

Hi @Blubberblub ,

without knowing exactly how the MultiAgent case is handled, maybe you take a look into the SyncSampler or AsyncSampler at this line where the policy gets evaluated and returns the actions. The _env_runner() functions actually performs the sampling. Set a breakpoint at the line I referred to above and debug your program.

Hope this helps

Hi @Blubberblub

Yes, once an agent returns it will no longer get an entry in the step return dictionaries. You will have to omit that agent’s key in subsequent calls to step. The last one you provide for it on the step when it was done will not be passed to the policy for actions.

An agent cannot be done and then come back to life.

At the very end of the episode all agents that are missing on that last time step will get a reward of 0.

A lot of environments are designed to always provide every agent on every step. What they do in this case is include a noop action and when an agent is “done” they will provide an action mask indicating that agent may only select the noop action.

Either of these two approaches will work. Which one you choose is a design choice you get to make.


Dealing with samples is quite complicated and little convoluted in the code but here is where it decides to pass the obs to the policy for actions

1 Like

Thanks a lot for the answers! That help clarify how thinks working and provided a lot of additional insight. Thanks especially to @mannyv to giving hints on how to design these systems.