How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Scenario
- Launching a Ray Data computation with
.map
, using Actors - On a KubeRay cluster
- Cluster nodes can terminate unexpectedly (with a grace period)
- The actors are started with these settings:
- max_restarts: 3
- max_task_retries: -1
For debugging, we enable the env var RAY_record_ref_creation_sites
.
Problem
When a cluster node (Ray worker) dies, the computation cannot recover and continue.
We see these errors:
(raylet) Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 1826, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1860, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 963, in ray._raylet.raise_if_dependency_failed
ray.exceptions.ReferenceCountingAssertionError: Failed to retrieve object 0031b89f94899135ffffffffffffffffffffffff03000000cee1f505. The ObjectRef was created at: (actor call)
/usr/local/lib/python3.11/dist-packages/ray/data/_internal/execution/operators/actor_pool_map_operator.py:_start_actor:154
/usr/local/lib/python3.11/dist-packages/ray/data/_internal/execution/operators/actor_pool_map_operator.py:scale_up:499
/usr/local/lib/python3.11/dist-packages/ray/data/_internal/execution/operators/actor_pool_map_operator.py:start:127
The object has already been deleted by the reference counting protocol. This should not happen.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 2271, in ray._raylet.task_execution_handler
File "python/ray/_raylet.pyx", line 2167, in ray._raylet.execute_task_with_cancellation_handler
File "python/ray/_raylet.pyx", line 1822, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1823, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 2061, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1054, in ray._raylet.store_task_errors
File "/usr/local/lib/python3.11/dist-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
return method(self, *_args, **_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ray/data/_internal/execution/operators/actor_pool_map_operator.py", line 398, in __repr__
return f"MapWorker({self.src_fn_name})"
^^^^^^^^^^^^^^^^
AttributeError: '_MapWorker' object has no attribute 'src_fn_name'
An unexpected internal error occurred while the worker was executing a task.