I am trying to implement a multiagent environment to train it with Rllib’s scalable PPO algorithm. However, the stepping of the environment does not correspond to the MultiAgent environment pipeline. I will appreciate if someone could give me some assistance on how to implement the training of this environment. The pipeline of the environment is the following:
There are two agents A and B
Both agents train the same model with PPO but with independent weights.
Agent A has to end its episode (step x number of times and generate a rollout buffer) before agent B starts its episode. When agent B’s episode ends, agent A goes again.
Agent B’s episode depends on agent A’s final state.
Initially it looks like A and B can be defined as single agent environments. The issue is that B’s episode depends on A’s episode. The ideal will be to have a worker assigned for a pair of A and B environments and collect asynchronously rollout buffers for A and B. Is this possible with Rllib?