I believe you just need to assign a different policy to each of the agents:
policies = {
'policy_1': (None, obs_space_1, action_space_2, {}),
'policy_2': (None, obs_space_2, action_space_2, {}),
...
}
def policy_mapping_fn(agent_id):
if agent_id == 'agent_1':
return 'policy_1'
elif agent_id == 'agent2':
return 'policy_2'
...
This will allow the different agents in your simulation to be controlled by different policies, and their specific rollout fragments will be used to train their policies. Here a full example using Abmarl.
I am also a MARL researcher, and I have been using RLlib for the past three years, and it has made my work significantly easier. It is designed to handle multi-agent simulations, so most of what you want to do is available. Please feel free to reach out directly if you have any MARL questions