Asymmetric play multiagent environment

I am trying to implement a multiagent environment to train it with Rllib’s scalable PPO algorithm. However, the stepping of the environment does not correspond to the MultiAgent environment pipeline. I will appreciate if someone could give me some assistance on how to implement the training of this environment. The pipeline of the environment is the following:

  • There are two agents A and B

  • Both agents train the same model with PPO but with independent weights.

  • Agent A has to end its episode (step x number of times and generate a rollout buffer) before agent B starts its episode. When agent B’s episode ends, agent A goes again.

  • Agent B’s episode depends on agent A’s final state.

Initially it looks like A and B can be defined as single agent environments. The issue is that B’s episode depends on A’s episode. The ideal will be to have a worker assigned for a pair of A and B environments and collect asynchronously rollout buffers for A and B. Is this possible with Rllib?

I would model this as a multi-agent environment with two agents, A and B, where each agent has its own policy (i.e., weights).

Inside the environment, you’d keep track of which agent is currently active and need to make sure that the step function only returns next observations for the agent that is acting next.
I.e., observations are a dict of agent ID → observation, and, AFIK, RLlib only gets actions from agents that have an observation. So if your environment always includes only the observation of the agent that is up next (A or B), only this agent is queried for the next action.

In doing so, only one agent acts at a time, still both agents have separate policies and the environment can depend on A and B.

I believe you can then run multiple copies of this environment in parallel, where each environment is specific to pairs of A and B agents. Not sure if duplicating the environment always makes sense in terms of scaling though: RLlib Training APIs — Ray v1.9.1


Thank you for your prompt and detailed response Stefan! This worked perfectly and there was no need to duplicate the environment.