IndexError in agent_collector

starkj · March 17, 2025, 5:51pm

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

Ray version: 2.40.0
Python version: 3.12
OS: Ubuntu 24.04
Cloud/Infrastructure: None
Other libs/tools (if relevant): torch 2.5.1

3. What happened vs. what you expected:

Expected: Continue training smoothly
Actual: After ~80 iterations (at first, then more frequently), a worker dies with no clear explanation. The worker gets restarted, but this creates a new environment, which starts my curriculum learning all over, so the curriculum never progresses, as all workers experience this before the curriculum can advance to its top level. Error message on stdout is:

A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff475c22055382cbb7ac5eba0401000000 Worker ID: bf83b8590b7137cc02b7e1b732bf85b679503edabcded057c42c9673 Node ID: 0822b2078c62453c436a59c76d8e343d651a9d5d996e4344386f19dd Worker IP address: 10.0.0.180 Worker port: 34309 Worker PID: 2795393 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code 1.

Digging deeper, in /tmp/ray/session_latest/logs/worker-<id>.err is the following:

/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/torch/nn/modules/linear.py:125: UserWarning: Could not parse CUBLAS_WORKSPACE_CONFIG, using default workspace size of 8519680 bytes. (Triggered internally at ../aten/src/ATen/cuda/CublasHandlePool.cpp:99.)
  return F.linear(input, self.weight, self.bias)
/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/torch/nn/modules/linear.py:125: UserWarning: Could not parse CUBLAS_WORKSPACE_CONFIG, using default workspace size of 8519680 bytes. (Triggered internally at ../aten/src/ATen/cuda/CublasHandlePool.cpp:99.)
  return F.linear(input, self.weight, self.bias)
2025-03-16 20:21:18,459 ERROR actor_manager.py:187 -- Worker exception caught during `apply()`: list index out of range
Traceback (most recent call last):
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/utils/actor_manager.py", line 183, in apply
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/execution/rollout_ops.py", line 108, in <lambda>
    (lambda w: w.sample(**random_action_kwargs))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
    return method(self, *_args, **_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/evaluation/rollout_worker.py", line 677, in sample
    batches = [self.input_reader.next()]
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/evaluation/sampler.py", line 59, in next
    batches = [self.get_data()]
               ^^^^^^^^^^^^^^^
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/evaluation/sampler.py", line 225, in get_data
    item = next(self._env_runner)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/evaluation/env_runner_v2.py", line 329, in run
    outputs = self.step()
              ^^^^^^^^^^^
File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/evaluation/env_runner_v2.py", line 355, in step
    active_envs, to_eval, outputs = self._process_observations(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/evaluation/env_runner_v2.py", line 622, in _process_observations
    processed = policy.agent_connectors(acd_list)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/connectors/agent/pipeline.py", line 41, in __call__
    ret = c(ret)
          ^^^^^^
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/connectors/connector.py", line 265, in __call__
    return [self.transform(d) for d in acd_list]
            ^^^^^^^^^^^^^^^^^
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/connectors/agent/view_requirement.py", line 117, in transform
    agent_collector.add_action_reward_next_obs(d)
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/evaluation/collectors/agent_collector.py", line 309, in add_action_reward_next_obs
    sub_list.append(flattened[i])
                    ~~~~~~~~~^^^
IndexError: list index out of range

I have added some logging to the add_action_reward_next_obs() method, and found that when it happens, the flattened list has zero length, i = 0 and k = “obs”. Maybe a RLlib expert can figure something out from that?

I am training a custom model with a custom gymnasium environment that has been running well for the past year (the past 2-3 months on v2.40). I started seeing this problem after my recent environment changes, so I’m sure that’s the culprit. But the manifestation is so deep inside RLlib code that I can’t figure it out. I see no obvious situation in my environment that may be the trigger, and the recent change was big enough that it is not possible to back it out in pieces.

Thanks!

mannyv · March 17, 2025, 11:03pm

Hi @starkj

I would probably start by printing acd_list here:

github.com/ray-project/ray

rllib/evaluation/env_runner_v2.py

146590b9d


      
          
          # Run agent connectors.
          for policy_id, batches in sample_batches_by_policy.items():
              policy: Policy = self._worker.policy_map[policy_id]
              # Collected full MultiAgentDicts for this environment.
              # Run agent connectors.
              assert (
                  policy.agent_connectors
              ), "EnvRunnerV2 requires agent connectors to work."
          
              acd_list: List[AgentConnectorDataType] = [
                  AgentConnectorDataType(env_id, agent_id, data)
                  for agent_id, data in batches
              ]
          
              # For all agents mapped to policy_id, run their data
              # through agent_connectors.
              processed = policy.agent_connectors(acd_list)
          
              for d in processed:
                  # Record transition info if applicable.

starkj · March 18, 2025, 12:04am

Thank you for the quick response, @mannyv. I was able to track down my problem. There was an rare situation that produced an empty obs vector, which caused RLlib to barf. Generating a dummy vector full of zeros and marking the episode terminated took care of it.

Topic		Replies	Views
Error when setting done=true: eval_data[i].env_id yields IndexError: list index out of range RLlib	2	872	February 18, 2021
Assert agent_key not in self.agent_collectors RLlib	7	1356	October 7, 2021
RLlib crashes with more workers and envs RLlib	8	1166	February 16, 2023
Error: IndexError: list index out of range RLlib	1	294	March 22, 2024
BUG: Error: IndexError: list index out of range in env_runner_v2.py Configure Algorithm, Training, Evaluation, Scaling	0	133	April 24, 2024

IndexError in agent_collector

Related topics