MultiAgentEnv reward and terminated / truncated

Suppose I have a custom co-operative MultiAgentEnv environment where agents appear and disappear during the episode, but all are collectively trying to achieve some goal, which is rewarded at the very end of the episode (or midway through as -1000 if all agents die before the time limit).

I have 4 related questions:

  1. Is it essential to set _agent_ids on the environment on __init__? I don’t know the total number of agents that might appear in the episode up front (it’s different each time as they spawn), but they all use the same policy, so I presume I don’t need to - I can just map them all to the same policy in the config of the RL algorithm I choose. Is that correct? In what circumstance would you ever need to set _agent_ids?

  2. An individual agent might die (i.e.terminated = True) before the episode end, but then only receive its reward at episode end, depending on the success or failure of the team as a whole. Am I ok to still set the overall reward for the agent at the episode end, long after it has died? (i.e. at a later timestep than the one where I set terminated=True for the agent?). If so, what is the purpose of setting terminated=True for an individual agent - how does it affect the RL algorithm learning process?

  3. Do I need to keep sending {unit_x: {terminated: True}} in the step function for every timestep after the agent has died or is once enough, at the timestep on which it dies.

  4. The docs here suggest that the tuple returned by step needs to have obs, reward, terminated and truncated for every ‘ready agent’. However, I think I’m right in saying that the keys for each of these dicts need not be the same - you should for example return observations for agents that need to make an action next move (i.e. ready agents), but the rewards can be for a totally different set of agents (i.e. agents that need not be ‘ready’, as described here). Is this correct? If so, are the docs incorrect?

Thanks you for your help in advance!