MultiAgent Setup

Tory · June 27, 2026, 10:17am

MultiAgent setup involving multiple agents.

RunLLM · June 27, 2026, 10:18am

Yes, hierarchical RL using a single MultiAgentEpisodeReplayBuffer is possible. The buffer is designed to store episodes from multi-agent environments, including hierarchical setups, by keeping track of experiences for each agent and policy separately, and can handle complex agent-policy mappings as in hierarchical RL scenarios. See MultiAgentEpisodeReplayBuffer source and official docs.

Would you like more detail on how to configure or use this buffer in a hierarchical RL setup?

Sources:

Hint: Mention @Herald in the post for followups.

RunLLM · June 27, 2026, 10:26am

You can implement this by designing your environment’s step() method to only return observations for agents expected to act at each timestep. For example, at t1, return obs for both Manager and Workers; at t2–t4, only for Workers; at t5, again for Manager and Workers, and so on. This pattern is supported by RLlib’s multi-agent API and is compatible with a single MultiAgentEpisodeReplayBuffer as used in MASAC. See this discussion and test_multi_agent_episode.py for reference.

Here’s a minimal dummy environment step logic:

from ray.rllib.env.multi_agent_env import MultiAgentEnv

class HierarchicalEnv(MultiAgentEnv):
    def __init__(self):
        self.t = 0
        self.manager_steps = {0, 4, 8}  # t1, t5, t9 (0-based)
        self.worker_steps = set(range(13))  # t1-t13 (0-based)
        self.agents = ["manager", "worker_1", "worker_2"]

    def reset(self):
        self.t = 0
        return {"manager": 0, "worker_1": 0, "worker_2": 0}

    def step(self, action_dict):
        obs, rew, done, info = {}, {}, {}, {}
        self.t += 1
        # Manager acts only at specified steps
        if self.t in self.manager_steps:
            obs["manager"] = self.t
            rew["manager"] = 0
            done["manager"] = False
            info["manager"] = {}
        # Workers act at every step
        if self.t in self.worker_steps:
            for wid in ["worker_1", "worker_2"]:
                obs[wid] = self.t
                rew[wid] = 0
                done[wid] = False
                info[wid] = {}
        done["__all__"] = self.t >= 12  # End after t13
        return obs, rew, done, info

This structure ensures only the relevant agents act at each timestep, matching your hierarchical RL timing.

Sources:

Hint: Mention @Herald in the post for followups.

RunLLM · June 27, 2026, 11:20am

Certainly! To use terminated and truncated (as required by RLlib’s Gymnasium-style API), return both per-agent dicts and always include the "__all__" key. Here’s the modified dummy environment step:

from ray.rllib.env.multi_agent_env import MultiAgentEnv

class HierarchicalEnv(MultiAgentEnv):
    def __init__(self):
        self.t = 0
        self.manager_steps = {0, 4, 8}  # t1, t5, t9 (0-based)
        self.worker_steps = set(range(13))  # t1-t13 (0-based)
        self.agents = ["manager", "worker_1", "worker_2"]

    def reset(self, *, seed=None, options=None):
        self.t = 0
        return {"manager": 0, "worker_1": 0, "worker_2": 0}, {}

    def step(self, action_dict):
        obs, rew, terminated, truncated, info = {}, {}, {}, {}, {}
        self.t += 1
        # Manager acts only at specified steps
        if self.t in self.manager_steps:
            obs["manager"] = self.t
            rew["manager"] = 0
            terminated["manager"] = False
            truncated["manager"] = False
            info["manager"] = {}
        # Workers act at every step
        if self.t in self.worker_steps:
            for wid in ["worker_1", "worker_2"]:
                obs[wid] = self.t
                rew[wid] = 0
                terminated[wid] = False
                truncated[wid] = False
                info[wid] = {}
        # End after t13 (t=12, 0-based)
        terminated["__all__"] = self.t >= 12
        truncated["__all__"] = False
        return obs, rew, terminated, truncated, info

This follows the required API: always return both terminated and truncated dicts with "__all__" keys, as shown in test_multi_agent_env.py.

Sources:

test_multi_agent_env.py

Hint: Mention @Herald in the post for followups.

RunLLM · June 27, 2026, 11:44am

This error means that when calling concat_episode, the current episode’s timestep (self.t) does not match the next episode’s starting timestep (other.t_started). This usually happens if episodes are not split or continued at the correct boundaries, or if your environment does not provide agent actions/observations in the expected sequence. See single_agent_episode.py source and related discussion.

Would you like more detail on how to debug or fix this in your environment?

Sources:

Hint: Mention @Herald in the post for followups.

RunLLM · June 27, 2026, 11:46am

This error typically occurs when the episode chunks for an agent are not aligned correctly—meaning the end timestep of one chunk (self.t) does not match the start timestep of the next chunk (other.t_started). This can happen if your environment skips agent actions/observations at certain timesteps or if the episode splitting logic is inconsistent. According to this Ray issue, you must ensure that for each agent, the episode chunks are concatenated only when self.t == other.t_started.

To fix this, check that your environment always provides agent data in a way that the timesteps line up across episode boundaries. If your agents act at different frequencies (as in hierarchical RL), make sure that the episode splitting and concatenation logic accounts for “hanging” actions or observations, and that the MultiAgentEpisode and SingleAgentEpisode objects are constructed with consistent t and t_started values for each agent. See the robust test and fix approach in PR #59895.

Sources:

PR #59895

Hint: Mention @Herald in the post for followups.

RunLLM · June 29, 2026, 11:26am

You do not need to manually accumulate Worker rewards; RLlib automatically accumulates rewards for each agent between their actions, including in hierarchical RL. Any rewards given to an agent when it is not acting are stored and summed until the agent acts again, at which point the total is attributed to the last action. See multi-agent reward handling discussion and multi-agent episode code.

Would you like more detail on how this works internally?

Sources:

Hint: Mention @Herald in the post for followups.

RunLLM · July 9, 2026, 6:24am

You can read the source files for RLlib’s MASAC (Multi-Agent Soft Actor-Critic) by exploring the RLlib repository, specifically the SAC algorithm files and multi-agent utilities. The main SAC implementation is in rllib/algorithms/sac/sac.py, and multi-agent support is handled by replay buffers like multi_agent_episode_buffer.py. You may also want to look at example scripts such as multi_agent_pendulum_sac.py.

Would you like a step-by-step guide on navigating these files?

Sources:

Hint: Mention @Herald in the post for followups.

Topic		Replies	Views
Hierachical multi-agent RL RLlib	1	625	February 15, 2023
Multiple hierarchical agents possible? RLlib	2	619	August 11, 2021
Special case MultiAgent Environment RLlib	1	47	May 22, 2025
Asymmetric play multiagent environment RLlib	2	502	January 6, 2022
Different episode segmentations for different agents in multiagent? RLlib	2	321	June 30, 2022

MultiAgent Setup

Related topics