It seems likely that this is a known issue with turn-based multi-agent environments in Ray RLlib, where only one agent acts per step and the info returned by get_infos(-1) in on_episode_step may not reflect the very last info dict for all agents. This behavior is consistent with the way RLlib tracks environment steps versus agent steps, and similar issues have been reported by other users, especially in turn-based or partially-observed settings. The info is correctly available in on_episode_end, suggesting the data is eventually synchronized, but not always available at the expected time in on_episode_step for the last environment step.
My understanding is that this is not explicitly documented as a bug, but it is a limitation of the current implementation, particularly for turn-based environments where not all agents act every step. There are open discussions and related issues about info and observation synchronization in such cases, and workarounds (like using on_episode_end) are commonly used. If this behavior is blocking or confusing, it may be worth opening a GitHub issue to clarify or request a fix, as the current design may not fully support all turn-based multi-agent use cases out of the box. Would you like more detail on the underlying cause or possible workarounds?
Sources:
- Observation and info out of sync
- MultiAgentEpisode.get_infos() implementation
- MultiAgentEpisode.get_infos() tests
Hint: Mention @RunLLM in the post for followups.