Multi-Agent PPO Evaluation Error

1. Severity of the issue: (select one)
[ x] High: Completely blocks me.

2. Environment:

  • Ray version: 2.43 (although I was able to replicate it with 2.49.1 as well)
  • Python version: 3.9.21
  • OS: macOS Sequoia 15.6.1

3. What happened vs. what you expected:

  • Expected:

I thought I’d be able to train and evaluate a multi-agent PPO algorithm like I have in older versions. I can train a multi-agent PPO algorithm, but there seems to be an issue with flattening the observation for multi-agent PPO during evaluation. I can get this code to work by removing the evaluation from the config or by turning it into a single agent problem, but I haven’t been able to get multi-agent PPO to work during evaluation. I don’t know if there’s a config setting I’m missing or if there’s an underlying problem with how the flattening is performed in evaluation versus training.

  • Actual:

The following error happens during evaluation of the multi-agent PPO algorithm

2025-09-17 14:09:30,665	ERROR tune_controller.py:1331 -- Trial task failed for trial PPO_MultiEnv_836b2_00000
Traceback (most recent call last):
  File "/lib/python3.9/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
  File "/lib/python3.9/site-packages/ray/_private/worker.py", line 2882, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/lib/python3.9/site-packages/ray/_private/worker.py", line 968, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(IndexError): ray::PPO.train() (pid=366, ip=127.0.0.1, actor_id=83f289b039d6612d6278bfb501000000, repr=PPO(env=<class 'ray.rllib.env.multi_agent_env.MultiEnv'>; env-runners=2; learners=0; multi-agent=True))
  File "/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 331, in train
    raise skipped from exception_cause(skipped)
  File "/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 328, in train
    result = self.step()
  File "/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 1044, in step
    eval_results = self._run_one_evaluation(parallel_train_future=None)
  File "/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 3531, in _run_one_evaluation
    eval_results = self.evaluate(
  File "lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 1269, in evaluate
    ) = self._evaluate_on_local_env_runner(self.eval_env_runner)
  File "/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 1413, in _evaluate_on_local_env_runner
    episodes = env_runner.sample(
  File "/lib/python3.9/site-packages/ray/rllib/env/multi_agent_env_runner.py", line 227, in sample
    samples = self._sample(
  File "/lib/python3.9/site-packages/ray/rllib/env/multi_agent_env_runner.py", line 487, in _sample
    self._cached_to_module = self._env_to_module(
  File "/lib/python3.9/site-packages/ray/rllib/connectors/env_to_module/env_to_module_pipeline.py", line 38, in __call__
    ret = super().__call__(
  File "/lib/python3.9/site-packages/ray/rllib/connectors/connector_pipeline_v2.py", line 123, in __call__
    batch = connector(
  File "/lib/python3.9/site-packages/ray/rllib/connectors/env_to_module/flatten_observations.py", line 185, in __call__
    flattened_obs = flatten_inputs_to_1d_tensor(
  File "/lib/python3.9/site-packages/ray/rllib/utils/numpy.py", line 293, in flatten_inputs_to_1d_tensor
    out.append(one_hot(input_, depth=space.n).astype(np.float32))
  File "/lib/python3.9/site-packages/ray/rllib/utils/numpy.py", line 526, in one_hot
    out[tuple(indices)] = on_value
IndexError: arrays used as indices must be of integer (or boolean) type

Here is the minimal code to replicate the error:

from ray import train, tune
from ray.rllib.policy.policy import PolicySpec
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.connectors.env_to_module import FlattenObservations
from ray.rllib.examples.envs.classes.multi_agent import MultiAgentCartPoleWithDictObservationSpace

env = MultiAgentCartPoleWithDictObservationSpace

config = (
    PPOConfig()
    .environment(
        env = env, 
        env_config = {"num_agents": 3}
    )
    .env_runners(
        env_to_module_connector = lambda env, spaces, device: FlattenObservations(multi_agent = True),
    )
    .multi_agent(
        policies = {
            "even_cart": PolicySpec(policy_class = None, observation_space = None, action_space = None, config = {"gamma": 0.9}),
            "odd_cart": PolicySpec(policy_class = None, observation_space = None, action_space = None, config = {"gamma": 0.1})
        },
        policy_mapping_fn = lambda agent_id, episode, **kwargs: "even_cart" if agent_id % 2 == 0 else "odd_cart"
    )
    .evaluation(
        evaluation_interval = 5 # this is the problem
    )
)

tuner = tune.Tuner(
    "PPO",
    param_space = config,
    run_config = train.RunConfig(
        stop = {"time_total_s": 60.0},
    ),
)

results = tuner.fit()

I didn’t copy and paste the env_to_module_connector correctly, to replicate the error it should be

env_to_module_connector = lambda env: FlattenObservations(multi_agent = True)

(this doesn’t solve it, it just results in the problem I mention above)

Not entirely sure what’s going wrong here, but I’m using MAPPO with evaluation without getting this error, and, because I’m using action masking, my base observation space is a dictionary just like yours. It could be that I have a custom eval function and that’s making the difference. You could try stubbing out your evaluation function with a custom one that runs a few episodes and reports the results, and seeing if that does it. The alternative is that the flattening is the problem, which could be tested by moving the flattening action into a custom encoder, or an environment wrapper.

In any case, my code (not in its best shape right now, pending an update) is here.

Yeah it seems like the error is some edge case because when I first tried to replicate it with the MultiAgentCartPole environment, I couldn’t. But then I saw MultiAgentCartPoleWithDictObservation in this example ( ray/rllib/examples/connectors/flatten_observations_dict_space.py at master · ray-project/ray · GitHub ) and was able to replicate the error. I definitely don’t think there’s a broad issue with all multi-agent evaluation loops. I haven’t had time to do a full debug yet, but as best as I can tell the source of the error seems to come from the if statement on line 436 of multi_agent_env_runner.py (ray/rllib/env/multi_agent_env_runner.py at 2e278ffd39401a98876b4c47d37c4d2beda75d0e · ray-project/ray · GitHub). If I comment that out, then everything will run without any errors, but I don’t know what the broader implications of that would be. So I’m trying to figure that out as well as what makes the code get there in the first place.

I did some more digging and it turns out this isn’t specifically an evaluation problem, it’s any time the batch_mode is set to “complete_episodes” for multi-agent problems. It just happens to be the case that batch_mode = “complete_episodes” by default for evaluation. It seems like there’s an issue in the _sample method of multi_agent_env_runner.py, and it looks like there’s an issue created for it: [RLlib] bug: env_to_module pipeline is run twice (on done) when "early-out" · Issue #53053 · ray-project/ray · GitHub

Yeah, I posted a pair of issues on the behavior of complete_episodes too. It’s currently implemented as running a full sample with a size of one episode and then repeating the sample method until the buffer is full, which feels somewhat inefficient, and does cause a number of things to break - especially WRT reporting metrics.

I’m sure there’s a reason it’s implemented the way it is, but I think there’s got to be a more robust solution. As it is, I jury-rigged a solution by removing complete_episodes and adding a callback that strips incomplete episodes from the front and back of a training batch. Costs one episode per learner per batch, but otherwise works great.

    @override(RLlibCallback)
    def on_sample_end(
        self,
        *,
        env_runner = None,
        metrics_logger = None,
        samples,
        # TODO (sven): Deprecate these args.
        worker = None,
        **kwargs,
    ) -> None:
        if (not env_runner.config.in_evaluation):
          if (samples[0].env_t_started > 0):
            del samples[0] # might be a continuation
          if (not samples[-1].is_done):
            del samples[-1]