1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.40.0
- Python version: 3.12
- OS: Ubuntu 24.04
- Cloud/Infrastructure: None
- Other libs/tools (if relevant): torch 2.5.1
3. What happened vs. what you expected:
- Expected: Continue training smoothly
- Actual: After ~80 iterations (at first, then more frequently), a worker dies with no clear explanation. The worker gets restarted, but this creates a new environment, which starts my curriculum learning all over, so the curriculum never progresses, as all workers experience this before the curriculum can advance to its top level. Error message on stdout is:
A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff475c22055382cbb7ac5eba0401000000 Worker ID: bf83b8590b7137cc02b7e1b732bf85b679503edabcded057c42c9673 Node ID: 0822b2078c62453c436a59c76d8e343d651a9d5d996e4344386f19dd Worker IP address: 10.0.0.180 Worker port: 34309 Worker PID: 2795393 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code 1.
Digging deeper, in /tmp/ray/session_latest/logs/worker-<id>.err
is the following:
/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/torch/nn/modules/linear.py:125: UserWarning: Could not parse CUBLAS_WORKSPACE_CONFIG, using default workspace size of 8519680 bytes. (Triggered internally at ../aten/src/ATen/cuda/CublasHandlePool.cpp:99.)
return F.linear(input, self.weight, self.bias)
/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/torch/nn/modules/linear.py:125: UserWarning: Could not parse CUBLAS_WORKSPACE_CONFIG, using default workspace size of 8519680 bytes. (Triggered internally at ../aten/src/ATen/cuda/CublasHandlePool.cpp:99.)
return F.linear(input, self.weight, self.bias)
2025-03-16 20:21:18,459 ERROR actor_manager.py:187 -- Worker exception caught during `apply()`: list index out of range
Traceback (most recent call last):
File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/utils/actor_manager.py", line 183, in apply
return func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/execution/rollout_ops.py", line 108, in <lambda>
(lambda w: w.sample(**random_action_kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
return method(self, *_args, **_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/evaluation/rollout_worker.py", line 677, in sample
batches = [self.input_reader.next()]
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/evaluation/sampler.py", line 59, in next
batches = [self.get_data()]
^^^^^^^^^^^^^^^
File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/evaluation/sampler.py", line 225, in get_data
item = next(self._env_runner)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/evaluation/env_runner_v2.py", line 329, in run
outputs = self.step()
^^^^^^^^^^^
File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/evaluation/env_runner_v2.py", line 355, in step
active_envs, to_eval, outputs = self._process_observations(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/evaluation/env_runner_v2.py", line 622, in _process_observations
processed = policy.agent_connectors(acd_list)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/connectors/agent/pipeline.py", line 41, in __call__
ret = c(ret)
^^^^^^
File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/connectors/connector.py", line 265, in __call__
return [self.transform(d) for d in acd_list]
^^^^^^^^^^^^^^^^^
File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/connectors/agent/view_requirement.py", line 117, in transform
agent_collector.add_action_reward_next_obs(d)
File "/home/starkj/miniconda3/envs/trader2/lib/python3.12/site-packages/ray/rllib/evaluation/collectors/agent_collector.py", line 309, in add_action_reward_next_obs
sub_list.append(flattened[i])
~~~~~~~~~^^^
IndexError: list index out of range
I have added some logging to the add_action_reward_next_obs() method, and found that when it happens, the flattened
list has zero length, i
= 0 and k
= “obs”. Maybe a RLlib expert can figure something out from that?
I am training a custom model with a custom gymnasium environment that has been running well for the past year (the past 2-3 months on v2.40). I started seeing this problem after my recent environment changes, so I’m sure that’s the culprit. But the manifestation is so deep inside RLlib code that I can’t figure it out. I see no obvious situation in my environment that may be the trigger, and the recent change was big enough that it is not possible to back it out in pieces.
Thanks!