I have a question about the MultiAgentEnv, in the latest version it returns terminated and truncated dictionaries. How should a two agent environment handle the case when one agent is terminated and the other is truncated but not yet terminated (e.g. reached time limit before being done). Is the following correct in this case? terminateds={"agent1": True, "agent2": True, "__all__": True}
truncateds={"agent1": False, "agent2": True, "__all__": False}
@Fady_B good question. Following the long discussion that led to the new return values in gymnasium
here, at best the truncated=True
comes with a terminated=True
as even though yous top because of a time limit reached (for example) this is still a form of termination with no loss (I can think of) information up to this point of time.
Ok because this is not the case for the gymnasium single agent case. In that case a truncated agent which has not reached a terminal state yet would have terminated=False and truncated=True, which should allow the value function approximation to still use bootstrapped reward as usual for the final state (as truncated=True indicates this final state is not a terminal state for the underlying MDP).
But I was wondering how RLLIB deals with this in the multi-agent case. So just to confirm for the case stated in the original question terminateds={"agent1": True, "agent2": True, "__all__": True}
truncateds={"agent1": False, "agent2": True, "__all__": False}
is correct and this would mean “agent2” would use bootstrapped reward for it’s final state, whereas “agent1” would not?
@Fady_B, good question! I had to look it up in the code. So, how RLlib treats this is: agents can terminate and/or truncate as well as the env itself (that is the __all__
). Meaning single agents could still not be terminated or truncated, but if the env terminates the episode is considered completed and all agents’ terminated
are set to the one of __all__
(same holds true for the truncated
of each agent). In this case the environment does also not need to provide agents with last observation, but instead RLlib will use fake last observations (sampled from obs space). In case of batch_mode="complete_episodes"
a MultiAgentSampleBatch
is generated (containing results from this episode).
Now in regard to your question for bootstrapping values: RLlib uses postprocessing for each single agent, i.e. a single agent batch is passed to the postprocessing function and RLlib considers therein only the terminated
s not the truncated
s. Therefore, if your agent has terminated no bootstrapping is used and instead the last reward is set to 0.0
whereas in case of terminated=False
the value function is used for bootstrapping. This is at least the case for PG/PPO.
Ok thanks, so if I understood correctly, if agent1 has terminated and agent2 has not terminated but has been truncated due to a time-limit, then the correct return from the environment would be terminateds={"agent1": True, "agent2": False, "__all__": False}
truncateds={"agent1": True, "agent2": True, "__all__": True}
and internally RLlib would end the episode and use bootstrapped reward only for agent2?
@Fady_B this looks good to me. Probably it does not make a difference if you set the "truncated"
of agent 1 to False
as it did not hit the time limit.