Multi-Agent PPO rollout stalls + “duplicate agent at same env-t” assertion under Connector-v2 (Ray 2.49.2)

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.49.2
  • Python version: 3.10.12
  • OS: Ubuntu 22.04
  • Cloud/Infrastructure: On-Premise Cluster
  • Other libs/tools (if relevant): Gymnasium

3. What happened vs. what you expected:

  • Expected:
  • Actual:

TL;DR

I’m building a 6-agent cooperative shuttle env with a shared LSTM policy. Under Connector-v2, actions often arrive in chunks per env-time. If my env waits for missing actions (by returning empty dicts), rollouts can starve and stall. If I re-emit obs while waiting, I hit:

AssertionError: agent_id not in self.env_t_to_agent_t

The only stable workaround so far is to always advance one env step per step() call and synthesize no-ops for missing actions, never returning empties mid-episode. I’d love guidance on the recommended synchronization pattern for MA envs with Connector-v2 (and any preferred config flags) so we avoid both starvation and duplicate-at-t assertions.

What I’m trying to build

  • 6 shuttles on a grid, shared single LSTM policy (PPO).

  • Per-agent 116-dim observation; 3 discrete actions.

  • Public env API uses string keys ('0'..'5'); internal dicts use int keys (0..5). I do boundary conversions only.

  • Goal: strict, deterministic step contract that plays nicely with Connector-v2’s partial action delivery.

Minimal Repro Env (stands alone)

This env demonstrates:

  • Boundary ID conversions (string public, int private).

  • Two modes in step():

    • WAIT_FOR_ALL = Falseno-op fill (advances every call) → stable.

    • WAIT_FOR_ALL = Truereturn empties while waiting → tends to stall / trigger the assertion depending on runner settings.

Run exactly as-is with the PPO config below. Flip WAIT_FOR_ALL to see the difference.

# shuttle_env_mre.py
import numpy as np
from typing import Dict, Any
from ray.rllib.env.multi_agent_env import MultiAgentEnv
import gymnasium as gym
from gymnasium import spaces

NUM_AGENTS = 6
OBS_DIM = 116
NO_OP = 0  # synthetic no-op action we use for missing actions

# Toggle this to see the two behaviors:
# - False: always advance, synthesize no-ops (works)
# - True:  wait by returning empties (can stall / trigger assertion under Connector-v2)
WAIT_FOR_ALL = False


class ShuttleEnvMRE(MultiAgentEnv):
    def __init__(self, env_config: Dict[str, Any] | None = None):
        super().__init__()
        self.num_agents = NUM_AGENTS
        self.agent_ids_int = list(range(self.num_agents))           # internal ints
        self.agent_ids_str = [str(i) for i in self.agent_ids_int]   # external strings

        # Single-agent leaf spaces (per agent)
        self.single_observation_space = spaces.Box(
            low=-np.inf, high=np.inf, shape=(OBS_DIM,), dtype=np.float32
        )
        self.single_action_space = spaces.Discrete(3)

        # Public MultiAgentEnv spaces are often declared as Dicts keyed by strings
        # (RLlib does not strictly require these attributes, but it's informative)
        self.observation_space = gym.spaces.Dict({
            aid: self.single_observation_space for aid in self.agent_ids_str
        })
        self.action_space = gym.spaces.Dict({
            aid: self.single_action_space for aid in self.agent_ids_str
        })

        # Dummy internal state
        self.t = 0
        self._last_obs = None

    def reset(self, *, seed=None, options=None):
        self.t = 0
        obs = {
            aid_str: self._make_obs(int(aid_str))
            for aid_str in self.agent_ids_str
        }
        self._last_obs = obs
        infos = {aid_str: {} for aid_str in self.agent_ids_str}
        return obs, infos

    def step(self, action_dict: Dict[str, int]):
        # action_dict keys come in as strings externally
        provided = set(action_dict.keys())
        expected = set(self.agent_ids_str)
        missing = list(expected - provided)

        if WAIT_FOR_ALL and missing:
            # "Wait" branch: return empties and do NOT advance time/state.
            # This is the pattern that can lead to starvation or, if you try to
            # re-emit obs, the duplicate-at-t assertion.
            return {}, {}, {"__all__": False}, {"__all__": False}, {}

        # Fill missing actions with a synthetic no-op and convert all to ints
        actions_int = {}
        for aid_str in self.agent_ids_str:
            a = action_dict.get(aid_str, NO_OP)
            actions_int[int(aid_str)] = int(a)  # boundary conversion

        # --- Advance exactly one env step ---
        self.t += 1
        # (Real env would update per-agent internal states based on actions here)

        # Build next obs/reward/terminations
        obs = {aid_str: self._make_obs(int(aid_str)) for aid_str in self.agent_ids_str}
        rewards = {aid_str: 0.0 for aid_str in self.agent_ids_str}
        term = {aid_str: False for aid_str in self.agent_ids_str}
        trunc = {aid_str: False for aid_str in self.agent_ids_str}

        # End after a short horizon to keep episodes tiny in the repro
        done = (self.t >= 16)
        term["__all__"] = done
        trunc["__all__"] = False

        infos = {aid_str: {} for aid_str in self.agent_ids_str}
        self._last_obs = obs
        return obs, rewards, term, trunc, infos

    def _make_obs(self, aid_int: int) -> np.ndarray:
        # Deterministic per-agent obs; real env would fill meaningful features
        rng = np.random.default_rng(seed=aid_int + self.t * 12345)
        return rng.standard_normal(OBS_DIM, dtype=np.float32)

PPO Config (Connector-v2 default stack)

This uses a shared policy for all 6 agents, keeps things single-runner with tiny fragments to make timing effects obvious, and enables LSTM in the model config (you can disable LSTM to see it’s not the culprit).

# train_mre.py
import ray
from ray.rllib.algorithms.ppo import PPOConfig
from shuttle_env_mre import ShuttleEnvMRE, NUM_AGENTS

if __name__ == "__main__":
    ray.init(ignore_reinit_error=True)

    def policy_mapping_fn(agent_id: str, *args, **kwargs):
        return "shared_policy"

    config = (
        PPOConfig()
        .environment(env=ShuttleEnvMRE, env_config={})
        .framework("torch")
        # Make delivery/chunking effects visible:
        .env_runners(
            num_env_runners=1,  # single runner for determinism in repro
            rollout_fragment_length=1,
        )
        .resources(num_gpus=0)
        .training(
            gamma=0.99,
            train_batch_size=256,
            sgd_minibatch_size=64,
            model={
                "use_lstm": True,                # flip to False to confirm it's not LSTM-specific
                "lstm_cell_size": 64,
                "lstm_use_prev_action": True,
                "lstm_use_prev_reward": True,
            },
        )
        .multi_agent(
            policies={
                "shared_policy": (
                    None,               # use default RLlib TorchModel for repro
                    ShuttleEnvMRE(None).single_observation_space,
                    ShuttleEnvMRE(None).single_action_space,
                    {},
                )
            },
            policy_mapping_fn=policy_mapping_fn,
            policies_to_train=["shared_policy"],
        )
        # Helpful while debugging MA timing:
        .rollouts(batch_mode="complete_episodes")
    )

    algo = config.build()
    for _ in range(10):
        result = algo.train()
        print(
            f"iter={result['training_iteration']} "
            f"ep_reward_mean={result.get('episode_reward_mean')} "
            f"eps={result.get('episodes_this_iter')}"
        )

How to run:

python train_mre.py
  1. With WAIT_FOR_ALL = False (default): env always advances with no-op fill; training proceeds.
  2. With WAIT_FOR_ALL = True: env returns empties while waiting → I observe stalls / “learner returns empty actions” patterns; if I try to re-emit obs to unblock, I can trigger:
AssertionError: agent_id not in self.env_t_to_agent_t

(duplicate agent emitted at the same env-time index)

Background (short)

  • Earlier, I also hit the classic spacing mismatch loop:

    • Declaring single-agent spaces vs. per-agent Dict spaces (TypeError: 'Box' object is not iterable').

    • String vs. int agent-ID mismatch (KeyError: '0') — fixed by strict boundary conversions: strings outside, ints inside.

  • Those are resolved; the core remaining issue is multi-agent step synchronization with partial action delivery under Connector-v2.

Questions for RLlib devs

  1. Is the “no-op fill” pattern the recommended contract for MA envs under Connector-v2 when actions are delivered in chunks at a given env-t?

    • i.e., never return empties mid-episode; always synthesize actions for missing agents and advance exactly once per step() call.
  2. If “no-op fill” is recommended, is there a canonical way to declare “no-op” at the algorithm/model level so we don’t hard-code it in the env (e.g., via connectors or a per-policy default action)?

  3. Are there config flags that coax the runner to deliver complete action dicts per env-t (for debugging or production), avoiding chunking entirely?

    • e.g., specific env_runners() / sampler settings beyond rollout_fragment_length=1, or a “classic delivery” switch?
  4. Any pitfalls with Dict spaces vs single-agent leaf spaces I should avoid when the policy is shared across all agents (Connector-v2 expectations)?

Notes you might ask about (already addressed)

  • Agent IDs: Externals '0'..'5', internals 0..5; convert only at the boundary; internal dicts never store string keys.

  • LSTM I/O: Using default Torch model here; in my project, custom RLModule returns actions: Long[B], logits, value, and state_out list — conforms to Connector-v2.

  • Terminations: terminateds/truncateds include both per-agent flags and "__all__"; agent set is fixed for the episode (no dynamic joins/exits).

It could be possible that the “no-op fill” pattern—where your environment always advances one step per call and synthesizes actions for missing agents—is the recommended approach for multi-agent environments under Connector-v2, especially when actions may arrive in chunks per environment time. Returning empty dicts mid-episode (i.e., waiting for all actions before advancing) seems to risk rollout starvation or assertion errors, as you observed. This aligns with the behavior described in Ray’s multi-agent docs and test environments, where environments are expected to advance deterministically and handle missing actions gracefully, often by providing a default or no-op action for agents that did not act in the current step. There does not appear to be a built-in config flag to force the runner to always deliver complete action dicts per env-t; the environment is expected to handle partial action delivery robustly by design. See the discussion and code in different_spaces_for_agents.py and the multi-agent environment docs.

Though I’m not entirely certain, it might be that there is no canonical way to declare a “no-op” at the algorithm/model level in RLlib; the environment typically defines what a no-op means for its agents. RLlib expects the environment to provide a valid action for every agent at each step, and the policy/model does not have a built-in mechanism for specifying a default action for missing agents. Regarding Dict spaces, as long as your policy mapping and observation/action space definitions are consistent and you handle agent ID conversions at the boundary, you should be fine—this is reflected in the multi-agent API examples. If you want more detail or code examples, let me know!

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM

Thank you for your response!

We’ve confirmed that our environment and Connector V2 setup are fine — the rollout stalls occur only when using an LSTM policy. The issue seems to stem from how Connector V2 delivers partial actions in a multi-agent setup: not all agents act every step. Since an LSTM expects continuous state_in → state_out transitions, skipped actions cause some agents’ hidden states to fall out of sync. When those agents later receive new actions, their observations get re-emitted with stale states, creating duplicate (agent_id, t) entries and triggering assertion errors like “Agent acted and then got truncated but did not receive a final observation.” In short, this is a multi-agent + LSTM state synchronization problem under Connector V2’s partial-action semantics, and we’re currently testing explicit per-agent state bookkeeping to handle it. On the other hand, I thought it might be worth to ask if you or your team has experience dealing with situations like this and would be able to provide any help in correctly configuring this setup.

It seems likely that this is a known challenge: in multi-agent RLlib setups with LSTM policies, Connector V2’s partial action delivery can cause LSTM hidden states to become unsynchronized if not every agent acts at every step. This can result in stale or mismatched state_in/state_out transitions, leading to assertion errors or duplicate (agent_id, t) entries, as you described. The recommended approach appears to be explicit per-agent LSTM state tracking within your environment or rollout logic, ensuring that each agent’s hidden state is only updated when it actually acts, and otherwise preserved for the next time it is scheduled to act. This pattern is discussed in this Ray forum thread, where users confirm that per-agent state management is necessary for correct LSTM operation in multi-agent, partially-observing environments.

My understanding is that there is no built-in RLlib mechanism to automatically synchronize LSTM states for agents that skip steps under Connector V2; manual bookkeeping is required. If you want a step-by-step example or more detail on how to implement this, let me know!

Sources:

Hint: Mention @RunLLM in the post for followups.