1. Severity of the issue: (select one)
[ x] High: Completely blocks me.
2. Environment:
- Ray version: 2.43 (although I was able to replicate it with 2.49.1 as well)
- Python version: 3.9.21
- OS: macOS Sequoia 15.6.1
3. What happened vs. what you expected:
- Expected:
I thought I’d be able to train and evaluate a multi-agent PPO algorithm like I have in older versions. I can train a multi-agent PPO algorithm, but there seems to be an issue with flattening the observation for multi-agent PPO during evaluation. I can get this code to work by removing the evaluation from the config or by turning it into a single agent problem, but I haven’t been able to get multi-agent PPO to work during evaluation. I don’t know if there’s a config setting I’m missing or if there’s an underlying problem with how the flattening is performed in evaluation versus training.
- Actual:
The following error happens during evaluation of the multi-agent PPO algorithm
2025-09-17 14:09:30,665 ERROR tune_controller.py:1331 -- Trial task failed for trial PPO_MultiEnv_836b2_00000
Traceback (most recent call last):
File "/lib/python3.9/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File "/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
File "/lib/python3.9/site-packages/ray/_private/worker.py", line 2882, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/lib/python3.9/site-packages/ray/_private/worker.py", line 968, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(IndexError): ray::PPO.train() (pid=366, ip=127.0.0.1, actor_id=83f289b039d6612d6278bfb501000000, repr=PPO(env=<class 'ray.rllib.env.multi_agent_env.MultiEnv'>; env-runners=2; learners=0; multi-agent=True))
File "/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 328, in train
result = self.step()
File "/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 1044, in step
eval_results = self._run_one_evaluation(parallel_train_future=None)
File "/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 3531, in _run_one_evaluation
eval_results = self.evaluate(
File "lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 1269, in evaluate
) = self._evaluate_on_local_env_runner(self.eval_env_runner)
File "/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 1413, in _evaluate_on_local_env_runner
episodes = env_runner.sample(
File "/lib/python3.9/site-packages/ray/rllib/env/multi_agent_env_runner.py", line 227, in sample
samples = self._sample(
File "/lib/python3.9/site-packages/ray/rllib/env/multi_agent_env_runner.py", line 487, in _sample
self._cached_to_module = self._env_to_module(
File "/lib/python3.9/site-packages/ray/rllib/connectors/env_to_module/env_to_module_pipeline.py", line 38, in __call__
ret = super().__call__(
File "/lib/python3.9/site-packages/ray/rllib/connectors/connector_pipeline_v2.py", line 123, in __call__
batch = connector(
File "/lib/python3.9/site-packages/ray/rllib/connectors/env_to_module/flatten_observations.py", line 185, in __call__
flattened_obs = flatten_inputs_to_1d_tensor(
File "/lib/python3.9/site-packages/ray/rllib/utils/numpy.py", line 293, in flatten_inputs_to_1d_tensor
out.append(one_hot(input_, depth=space.n).astype(np.float32))
File "/lib/python3.9/site-packages/ray/rllib/utils/numpy.py", line 526, in one_hot
out[tuple(indices)] = on_value
IndexError: arrays used as indices must be of integer (or boolean) type
Here is the minimal code to replicate the error:
from ray import train, tune
from ray.rllib.policy.policy import PolicySpec
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.connectors.env_to_module import FlattenObservations
from ray.rllib.examples.envs.classes.multi_agent import MultiAgentCartPoleWithDictObservationSpace
env = MultiAgentCartPoleWithDictObservationSpace
config = (
PPOConfig()
.environment(
env = env,
env_config = {"num_agents": 3}
)
.env_runners(
env_to_module_connector = lambda env, spaces, device: FlattenObservations(multi_agent = True),
)
.multi_agent(
policies = {
"even_cart": PolicySpec(policy_class = None, observation_space = None, action_space = None, config = {"gamma": 0.9}),
"odd_cart": PolicySpec(policy_class = None, observation_space = None, action_space = None, config = {"gamma": 0.1})
},
policy_mapping_fn = lambda agent_id, episode, **kwargs: "even_cart" if agent_id % 2 == 0 else "odd_cart"
)
.evaluation(
evaluation_interval = 5 # this is the problem
)
)
tuner = tune.Tuner(
"PPO",
param_space = config,
run_config = train.RunConfig(
stop = {"time_total_s": 60.0},
),
)
results = tuner.fit()