1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.49.2
- Python version: 3.10.12
- OS: Ubuntu 22.04
- Cloud/Infrastructure: On-Premise Cluster
- Other libs/tools (if relevant): Gymnasium
3. What happened vs. what you expected:
- Expected:
- Actual:
TL;DR
I’m building a 6-agent cooperative shuttle env with a shared LSTM policy. Under Connector-v2, actions often arrive in chunks per env-time. If my env waits for missing actions (by returning empty dicts), rollouts can starve and stall. If I re-emit obs while waiting, I hit:
AssertionError: agent_id not in self.env_t_to_agent_t
The only stable workaround so far is to always advance one env step per step() call and synthesize no-ops for missing actions, never returning empties mid-episode. I’d love guidance on the recommended synchronization pattern for MA envs with Connector-v2 (and any preferred config flags) so we avoid both starvation and duplicate-at-t assertions.
What I’m trying to build
-
6 shuttles on a grid, shared single LSTM policy (PPO).
-
Per-agent 116-dim observation; 3 discrete actions.
-
Public env API uses string keys (
'0'..'5'); internal dicts use int keys (0..5). I do boundary conversions only. -
Goal: strict, deterministic step contract that plays nicely with Connector-v2’s partial action delivery.
Minimal Repro Env (stands alone)
This env demonstrates:
-
Boundary ID conversions (string public, int private).
-
Two modes in
step():-
WAIT_FOR_ALL = False⇒ no-op fill (advances every call) → stable. -
WAIT_FOR_ALL = True⇒ return empties while waiting → tends to stall / trigger the assertion depending on runner settings.
-
Run exactly as-is with the PPO config below. Flip
WAIT_FOR_ALLto see the difference.
# shuttle_env_mre.py
import numpy as np
from typing import Dict, Any
from ray.rllib.env.multi_agent_env import MultiAgentEnv
import gymnasium as gym
from gymnasium import spaces
NUM_AGENTS = 6
OBS_DIM = 116
NO_OP = 0 # synthetic no-op action we use for missing actions
# Toggle this to see the two behaviors:
# - False: always advance, synthesize no-ops (works)
# - True: wait by returning empties (can stall / trigger assertion under Connector-v2)
WAIT_FOR_ALL = False
class ShuttleEnvMRE(MultiAgentEnv):
def __init__(self, env_config: Dict[str, Any] | None = None):
super().__init__()
self.num_agents = NUM_AGENTS
self.agent_ids_int = list(range(self.num_agents)) # internal ints
self.agent_ids_str = [str(i) for i in self.agent_ids_int] # external strings
# Single-agent leaf spaces (per agent)
self.single_observation_space = spaces.Box(
low=-np.inf, high=np.inf, shape=(OBS_DIM,), dtype=np.float32
)
self.single_action_space = spaces.Discrete(3)
# Public MultiAgentEnv spaces are often declared as Dicts keyed by strings
# (RLlib does not strictly require these attributes, but it's informative)
self.observation_space = gym.spaces.Dict({
aid: self.single_observation_space for aid in self.agent_ids_str
})
self.action_space = gym.spaces.Dict({
aid: self.single_action_space for aid in self.agent_ids_str
})
# Dummy internal state
self.t = 0
self._last_obs = None
def reset(self, *, seed=None, options=None):
self.t = 0
obs = {
aid_str: self._make_obs(int(aid_str))
for aid_str in self.agent_ids_str
}
self._last_obs = obs
infos = {aid_str: {} for aid_str in self.agent_ids_str}
return obs, infos
def step(self, action_dict: Dict[str, int]):
# action_dict keys come in as strings externally
provided = set(action_dict.keys())
expected = set(self.agent_ids_str)
missing = list(expected - provided)
if WAIT_FOR_ALL and missing:
# "Wait" branch: return empties and do NOT advance time/state.
# This is the pattern that can lead to starvation or, if you try to
# re-emit obs, the duplicate-at-t assertion.
return {}, {}, {"__all__": False}, {"__all__": False}, {}
# Fill missing actions with a synthetic no-op and convert all to ints
actions_int = {}
for aid_str in self.agent_ids_str:
a = action_dict.get(aid_str, NO_OP)
actions_int[int(aid_str)] = int(a) # boundary conversion
# --- Advance exactly one env step ---
self.t += 1
# (Real env would update per-agent internal states based on actions here)
# Build next obs/reward/terminations
obs = {aid_str: self._make_obs(int(aid_str)) for aid_str in self.agent_ids_str}
rewards = {aid_str: 0.0 for aid_str in self.agent_ids_str}
term = {aid_str: False for aid_str in self.agent_ids_str}
trunc = {aid_str: False for aid_str in self.agent_ids_str}
# End after a short horizon to keep episodes tiny in the repro
done = (self.t >= 16)
term["__all__"] = done
trunc["__all__"] = False
infos = {aid_str: {} for aid_str in self.agent_ids_str}
self._last_obs = obs
return obs, rewards, term, trunc, infos
def _make_obs(self, aid_int: int) -> np.ndarray:
# Deterministic per-agent obs; real env would fill meaningful features
rng = np.random.default_rng(seed=aid_int + self.t * 12345)
return rng.standard_normal(OBS_DIM, dtype=np.float32)
PPO Config (Connector-v2 default stack)
This uses a shared policy for all 6 agents, keeps things single-runner with tiny fragments to make timing effects obvious, and enables LSTM in the model config (you can disable LSTM to see it’s not the culprit).
# train_mre.py
import ray
from ray.rllib.algorithms.ppo import PPOConfig
from shuttle_env_mre import ShuttleEnvMRE, NUM_AGENTS
if __name__ == "__main__":
ray.init(ignore_reinit_error=True)
def policy_mapping_fn(agent_id: str, *args, **kwargs):
return "shared_policy"
config = (
PPOConfig()
.environment(env=ShuttleEnvMRE, env_config={})
.framework("torch")
# Make delivery/chunking effects visible:
.env_runners(
num_env_runners=1, # single runner for determinism in repro
rollout_fragment_length=1,
)
.resources(num_gpus=0)
.training(
gamma=0.99,
train_batch_size=256,
sgd_minibatch_size=64,
model={
"use_lstm": True, # flip to False to confirm it's not LSTM-specific
"lstm_cell_size": 64,
"lstm_use_prev_action": True,
"lstm_use_prev_reward": True,
},
)
.multi_agent(
policies={
"shared_policy": (
None, # use default RLlib TorchModel for repro
ShuttleEnvMRE(None).single_observation_space,
ShuttleEnvMRE(None).single_action_space,
{},
)
},
policy_mapping_fn=policy_mapping_fn,
policies_to_train=["shared_policy"],
)
# Helpful while debugging MA timing:
.rollouts(batch_mode="complete_episodes")
)
algo = config.build()
for _ in range(10):
result = algo.train()
print(
f"iter={result['training_iteration']} "
f"ep_reward_mean={result.get('episode_reward_mean')} "
f"eps={result.get('episodes_this_iter')}"
)
How to run:
python train_mre.py
- With
WAIT_FOR_ALL = False(default): env always advances with no-op fill; training proceeds. - With
WAIT_FOR_ALL = True: env returns empties while waiting → I observe stalls / “learner returns empty actions” patterns; if I try to re-emit obs to unblock, I can trigger:
AssertionError: agent_id not in self.env_t_to_agent_t
(duplicate agent emitted at the same env-time index)
Background (short)
-
Earlier, I also hit the classic spacing mismatch loop:
-
Declaring single-agent spaces vs. per-agent Dict spaces (TypeError:
'Box' object is not iterable'). -
String vs. int agent-ID mismatch (
KeyError: '0') — fixed by strict boundary conversions: strings outside, ints inside.
-
-
Those are resolved; the core remaining issue is multi-agent step synchronization with partial action delivery under Connector-v2.
Questions for RLlib devs
-
Is the “no-op fill” pattern the recommended contract for MA envs under Connector-v2 when actions are delivered in chunks at a given env-t?
- i.e., never return empties mid-episode; always synthesize actions for missing agents and advance exactly once per
step()call.
- i.e., never return empties mid-episode; always synthesize actions for missing agents and advance exactly once per
-
If “no-op fill” is recommended, is there a canonical way to declare “no-op” at the algorithm/model level so we don’t hard-code it in the env (e.g., via connectors or a per-policy default action)?
-
Are there config flags that coax the runner to deliver complete action dicts per env-t (for debugging or production), avoiding chunking entirely?
- e.g., specific
env_runners()/ sampler settings beyondrollout_fragment_length=1, or a “classic delivery” switch?
- e.g., specific
-
Any pitfalls with Dict spaces vs single-agent leaf spaces I should avoid when the policy is shared across all agents (Connector-v2 expectations)?
Notes you might ask about (already addressed)
-
Agent IDs: Externals
'0'..'5', internals0..5; convert only at the boundary; internal dicts never store string keys. -
LSTM I/O: Using default Torch model here; in my project, custom RLModule returns
actions: Long[B], logits, value, and state_out list — conforms to Connector-v2. -
Terminations:
terminateds/truncatedsinclude both per-agent flags and"__all__"; agent set is fixed for the episode (no dynamic joins/exits).