IndexError in agent_collector

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.40.0
  • Python version: 3.12
  • OS: Ubuntu 24.04
  • Cloud/Infrastructure: None
  • Other libs/tools (if relevant): torch 2.5.1

3. What happened vs. what you expected:

  • Expected: Continue training smoothly
  • Actual: After ~80 iterations (at first, then more frequently), a worker dies with no clear explanation. The worker gets restarted, but this creates a new environment, which starts my curriculum learning all over, so the curriculum never progresses, as all workers experience this before the curriculum can advance to its top level. Error message on stdout is:

A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff475c22055382cbb7ac5eba0401000000 Worker ID: bf83b8590b7137cc02b7e1b732bf85b679503edabcded057c42c9673 Node ID: 0822b2078c62453c436a59c76d8e343d651a9d5d996e4344386f19dd Worker IP address: 10.0.0.180 Worker port: 34309 Worker PID: 2795393 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code 1.

Digging deeper, in /tmp/ray/session_latest/logs/worker-<id>.err is the following:

/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/torch/nn/modules/linear.py:125: UserWarning: Could not parse CUBLAS_WORKSPACE_CONFIG, using default workspace size of 8519680 bytes. (Triggered internally at ../aten/src/ATen/cuda/CublasHandlePool.cpp:99.)
  return F.linear(input, self.weight, self.bias)
/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/torch/nn/modules/linear.py:125: UserWarning: Could not parse CUBLAS_WORKSPACE_CONFIG, using default workspace size of 8519680 bytes. (Triggered internally at ../aten/src/ATen/cuda/CublasHandlePool.cpp:99.)
  return F.linear(input, self.weight, self.bias)
2025-03-16 20:21:18,459 ERROR actor_manager.py:187 -- Worker exception caught during `apply()`: list index out of range
Traceback (most recent call last):
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/utils/actor_manager.py", line 183, in apply
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/execution/rollout_ops.py", line 108, in <lambda>
    (lambda w: w.sample(**random_action_kwargs))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
    return method(self, *_args, **_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/evaluation/rollout_worker.py", line 677, in sample
    batches = [self.input_reader.next()]
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/evaluation/sampler.py", line 59, in next
    batches = [self.get_data()]
               ^^^^^^^^^^^^^^^
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/evaluation/sampler.py", line 225, in get_data
    item = next(self._env_runner)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/evaluation/env_runner_v2.py", line 329, in run
    outputs = self.step()
              ^^^^^^^^^^^
File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/evaluation/env_runner_v2.py", line 355, in step
    active_envs, to_eval, outputs = self._process_observations(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/evaluation/env_runner_v2.py", line 622, in _process_observations
    processed = policy.agent_connectors(acd_list)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/connectors/agent/pipeline.py", line 41, in __call__
    ret = c(ret)
          ^^^^^^
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/connectors/connector.py", line 265, in __call__
    return [self.transform(d) for d in acd_list]
            ^^^^^^^^^^^^^^^^^
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/connectors/agent/view_requirement.py", line 117, in transform
    agent_collector.add_action_reward_next_obs(d)
  File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/evaluation/collectors/agent_collector.py", line 309, in add_action_reward_next_obs
    sub_list.append(flattened[i])
                    ~~~~~~~~~^^^
IndexError: list index out of range

I have added some logging to the add_action_reward_next_obs() method, and found that when it happens, the flattened list has zero length, i = 0 and k = “obs”. Maybe a RLlib expert can figure something out from that?

I am training a custom model with a custom gymnasium environment that has been running well for the past year (the past 2-3 months on v2.40). I started seeing this problem after my recent environment changes, so I’m sure that’s the culprit. But the manifestation is so deep inside RLlib code that I can’t figure it out. I see no obvious situation in my environment that may be the trigger, and the recent change was big enough that it is not possible to back it out in pieces.

Thanks!

Hi @starkj

I would probably start by printing acd_list here:

Thank you for the quick response, @mannyv. I was able to track down my problem. There was an rare situation that produced an empty obs vector, which caused RLlib to barf. Generating a dummy vector full of zeros and marking the episode terminated took care of it.