Vectorized multi-agent setup

It seems like the multi-agent architecture in RLLib expects MultiAgentDict’s for the observations, dones, infos, and rewards.

Are there plans to support a vectorized version of this such that instead of Dict[agent id, np.ndarray] we can simply have np.ndarray’s where the first dimension is assumed to be agent dimension? Of course one constraint we can impose is that all the agents share the same underlying policy.

Hey @richard , thanks for sharing this idea. No, we have not thought about a setup like this, where the agent dimension is “just another dim”, like batch or time. I’m guessing this would be very useful for large amounts of agents sharing the same obs/action spaces or - better - policies.

Yes exactly. I’m trying to spec out if there are any gotchas in implementing this: do you happen to have an idea of how involved this would be to implement?

I think this could be quite easy, actually.
You would have to let RLlib know via some config flag that the env returns agent-batches instead of agent IDs. So each item in the returned np.array would be corresponding to the agent’s ID:

The env would do (obs space=Discrete(2)):
obs = np.array([0, 1, 1, 0])
return obs, rew, ...

And RLlib would interpret this as:

{0: 0, 1: 1, 2: 1, 3: 0}, where the keys are the agents' implicit IDs.

We would - I think - only have to change the _env_runner generator in rllib/evaluation/ to interpret raw observations from the env as implicitly agent-wise batches, that’s all. Everything else would still be the same (batched forward pass to calculate actions). Also, we would NOT(!) have to re-write the produced actions anymore into a dict to be sent to the env, but can leave the batched action computations as is as the env would probably want the actions to be np.arrays as well.