MultiAgent setup involving multiple agents.
Yes, hierarchical RL using a single MultiAgentEpisodeReplayBuffer is possible. The buffer is designed to store episodes from multi-agent environments, including hierarchical setups, by keeping track of experiences for each agent and policy separately, and can handle complex agent-policy mappings as in hierarchical RL scenarios. See MultiAgentEpisodeReplayBuffer source and official docs.
Would you like more detail on how to configure or use this buffer in a hierarchical RL setup?
Sources:
Hint: Mention @Herald in the post for followups.
You can implement this by designing your environment’s step() method to only return observations for agents expected to act at each timestep. For example, at t1, return obs for both Manager and Workers; at t2–t4, only for Workers; at t5, again for Manager and Workers, and so on. This pattern is supported by RLlib’s multi-agent API and is compatible with a single MultiAgentEpisodeReplayBuffer as used in MASAC. See this discussion and test_multi_agent_episode.py for reference.
Here’s a minimal dummy environment step logic:
from ray.rllib.env.multi_agent_env import MultiAgentEnv
class HierarchicalEnv(MultiAgentEnv):
def __init__(self):
self.t = 0
self.manager_steps = {0, 4, 8} # t1, t5, t9 (0-based)
self.worker_steps = set(range(13)) # t1-t13 (0-based)
self.agents = ["manager", "worker_1", "worker_2"]
def reset(self):
self.t = 0
return {"manager": 0, "worker_1": 0, "worker_2": 0}
def step(self, action_dict):
obs, rew, done, info = {}, {}, {}, {}
self.t += 1
# Manager acts only at specified steps
if self.t in self.manager_steps:
obs["manager"] = self.t
rew["manager"] = 0
done["manager"] = False
info["manager"] = {}
# Workers act at every step
if self.t in self.worker_steps:
for wid in ["worker_1", "worker_2"]:
obs[wid] = self.t
rew[wid] = 0
done[wid] = False
info[wid] = {}
done["__all__"] = self.t >= 12 # End after t13
return obs, rew, done, info
This structure ensures only the relevant agents act at each timestep, matching your hierarchical RL timing.
Sources:
Hint: Mention @Herald in the post for followups.
Certainly! To use terminated and truncated (as required by RLlib’s Gymnasium-style API), return both per-agent dicts and always include the "__all__" key. Here’s the modified dummy environment step:
from ray.rllib.env.multi_agent_env import MultiAgentEnv
class HierarchicalEnv(MultiAgentEnv):
def __init__(self):
self.t = 0
self.manager_steps = {0, 4, 8} # t1, t5, t9 (0-based)
self.worker_steps = set(range(13)) # t1-t13 (0-based)
self.agents = ["manager", "worker_1", "worker_2"]
def reset(self, *, seed=None, options=None):
self.t = 0
return {"manager": 0, "worker_1": 0, "worker_2": 0}, {}
def step(self, action_dict):
obs, rew, terminated, truncated, info = {}, {}, {}, {}, {}
self.t += 1
# Manager acts only at specified steps
if self.t in self.manager_steps:
obs["manager"] = self.t
rew["manager"] = 0
terminated["manager"] = False
truncated["manager"] = False
info["manager"] = {}
# Workers act at every step
if self.t in self.worker_steps:
for wid in ["worker_1", "worker_2"]:
obs[wid] = self.t
rew[wid] = 0
terminated[wid] = False
truncated[wid] = False
info[wid] = {}
# End after t13 (t=12, 0-based)
terminated["__all__"] = self.t >= 12
truncated["__all__"] = False
return obs, rew, terminated, truncated, info
This follows the required API: always return both terminated and truncated dicts with "__all__" keys, as shown in test_multi_agent_env.py.
Sources:
Hint: Mention @Herald in the post for followups.
This error means that when calling concat_episode, the current episode’s timestep (self.t) does not match the next episode’s starting timestep (other.t_started). This usually happens if episodes are not split or continued at the correct boundaries, or if your environment does not provide agent actions/observations in the expected sequence. See single_agent_episode.py source and related discussion.
Would you like more detail on how to debug or fix this in your environment?
Sources:
Hint: Mention @Herald in the post for followups.
This error typically occurs when the episode chunks for an agent are not aligned correctly—meaning the end timestep of one chunk (self.t) does not match the start timestep of the next chunk (other.t_started). This can happen if your environment skips agent actions/observations at certain timesteps or if the episode splitting logic is inconsistent. According to this Ray issue, you must ensure that for each agent, the episode chunks are concatenated only when self.t == other.t_started.
To fix this, check that your environment always provides agent data in a way that the timesteps line up across episode boundaries. If your agents act at different frequencies (as in hierarchical RL), make sure that the episode splitting and concatenation logic accounts for “hanging” actions or observations, and that the MultiAgentEpisode and SingleAgentEpisode objects are constructed with consistent t and t_started values for each agent. See the robust test and fix approach in PR #59895.
Sources:
Hint: Mention @Herald in the post for followups.
You do not need to manually accumulate Worker rewards; RLlib automatically accumulates rewards for each agent between their actions, including in hierarchical RL. Any rewards given to an agent when it is not acting are stored and summed until the agent acts again, at which point the total is attributed to the last action. See multi-agent reward handling discussion and multi-agent episode code.
Would you like more detail on how this works internally?
Sources:
Hint: Mention @Herald in the post for followups.