Potential bug in trajectory view API for multiagent envs

Does the trajectory view API support multiagent environments? I’m currently hitting the following internal error from trainer.train():

  File "/home/hex/anaconda3/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 327, in gen_rollouts
    yield self.sample()
  File "/home/hex/anaconda3/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 661, in sample
    batches = [self.input_reader.next()]
  File "/home/hex/anaconda3/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 94, in next
    batches = [self.get_data()]
  File "/home/hex/anaconda3/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 223, in get_data
    item = next(self.rollout_provider)
  File "/home/hex/anaconda3/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 669, in _env_runner
  File "/home/hex/anaconda3/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 1446, in _process_policy_eval_results
    env_id: int = eval_data[i].env_id
IndexError: list index out of range

As you can see below, I have one environment with multiple agents. The for loop in _process_policy_eval_results seems to assume one action per environment.

(Pdb) p actions
[{<class 'forge.blade.io.action.static.Attack'>: {<class 'forge.blade.io.action.static.Style'>: 2, <class 'forge.blade.io.action.static.Target'>: 57}, <class 'forge.blade.io.action.static.Move'>: {<class 'forge.blade.io.action.static.Direction'>: 1}}, {<class 'forge.blade.io.action.static.Attack'>: {<class 'forge.blade.io.action.static.Style'>: 0, <class 'forge.blade.io.action.static.Target'>: 46}, <class 'forge.blade.io.action.static.Move'>: {<class 'forge.blade.io.action.static.Direction'>: 1}}]
(Pdb) p eval_data
[PolicyEvalData(env_id=0, agent_id=198, obs=array([  0.,   0.,  10., ...,   2.,  24., 137.], dtype=float32), info={}, rnn_state=[array([[0., 0.]], dtype=float32), array([[0., 0.]], dtype=float32)], prev_action=None, prev_reward=0.0)]
1 Like

Thanks for raising this @jsuarez5341 .
Hmm, I’m actually sure it does handle these cases well (it’s on by default now and all our multi-agent tests are passing, even those where the agents don’t step at the same time). Could you provide a repro script that I can debug?

@sven1977 Thanks for the reply – always tough for me to provide repros, since these always show up in the context of my larger project, so it’s difficult for me to determine which exact combination of rllib features trigger this. I’ll do some more digging to see if I can at least isolate it though. Check back soon – will try to get to the bottom of this asap, since it would be really nice to have this fix in 1.2

@sven1977 Have been working on isolating the bug. Progress thus far:

  • Bug occurs with either count_steps_by option
  • Bug occurs with either parametric or flat actions
  • Bug still occurs with only 1 concurrent agent (still multiagent, but only one at any given time)
  • Bug still occurs with a builtin model
  • Bug occurs with either batch_mode
  • The stored PolicyEvalData appears to contain data for the last agent in the episode

Edit: just confirmed the bug on a fresh ray install. Same bug on 1.1.0 and 1.2.0.dev0

I believe the bug is in _process_observations_w_trajectory_view_api. While eval_results obtained using the trajectory view API appears to be correct, to_eval is only returning a single PolicyEvalData named tuple regardless of the number of agents

Edit: Did some more digging. The bug is triggered by an env reset when there is already data in the sample_collector. to_eval contains new observations from resetting the environment, but it appears that the sample collector has other leftover (possibly stale) data. Since the _process_policy_eval_results function loop keys over eval_results but loops through to_eval, there is a length mismatch.

@sven1977 Do you know which data structure is correct? Is to_eval missing data from sample_collector or does sample_collector contain stale data?

Found it!

all our multi-agent tests are passing, even those where the agents don’t step at the same time

RLlib doesn’t have a tests for variable-agent envs where agents are periodically added and removed mid-episode. This is the default setting for artificial-life inspired work. Repro below

import gym
import random
import unittest

import ray

from ray.tune.registry import register_env
from ray.rllib.agents.pg import PGTrainer
from ray.rllib.examples.env.multi_agent import BasicMultiAgent

from ray.rllib.env.multi_agent_env import MultiAgentEnv
from ray.rllib.examples.env.mock_env import MockEnv

class BasicMultiAgent(MultiAgentEnv):
    """Env of N independent agents, each of which exits after 25 steps."""

    def __init__(self, num):
        self.agents = {}
        self.agentID = 0
        self.dones = set()
        self.observation_space = gym.spaces.Discrete(2)
        self.action_space = gym.spaces.Discrete(2)
        self.resetted = False
    def spawn(self):
        agentID = self.agentID
        self.agents[agentID] = MockEnv(25)
        self.agentID += 1
        return agentID

    def reset(self):
        self.agents = {}
        self.resetted = True
        self.dones = set()
        obs = {}
        for i, a in self.agents.items():
           obs[i] = a.reset()

        return obs

    def step(self, action_dict):
        obs, rew, done, info = {}, {}, {}, {}
        for i, action in action_dict.items():
            obs[i], rew[i], done[i], info[i] = self.agents[i].step(action)
            if done[i]:

        if random.random() > 0.75:
           i = self.spawn()
           obs[i], rew[i], done[i], info[i] = self.agents[i].step(action)
           if done[i]:

        if len(self.agents) > 1 and random.random() > 0.25:
           keys = list(self.agents.keys())
           key  = random.choice(keys)
           done[key] = True
           del self.agents[key]

        done["__all__"] = len(self.dones) == len(self.agents)
        return obs, rew, done, info

class TestMultiAgentEnv(unittest.TestCase):
    def setUpClass(cls) -> None:

    def tearDownClass(cls) -> None:

    def test_train_multi_agent_cartpole_single_policy(self):
        n = 10
                     lambda _: BasicMultiAgent({'num_agents': 10}))
        pg = PGTrainer(
                "num_workers": 0,
                "framework": "torch",
        for i in range(50):
            result = pg.train()
            print("Iteration {}, reward {}, timesteps {}".format(
                i, result["episode_reward_mean"], result["timesteps_total"]))
            if result["episode_reward_mean"] >= 50 * n:
        raise Exception("failed to improve reward")

if __name__ == '__main__':

Thanks for digging into this. This bug was fixed in this PR. I also added a test case that covers this scenario using your provided Env.

1 Like