Node fault tolerance in Ray Data

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Scenario

  • Launching a Ray Data computation with .map, using Actors
  • On a KubeRay cluster
  • Cluster nodes can terminate unexpectedly (with a grace period)
  • The actors are started with these settings:
    • max_restarts: 3
    • max_task_retries: -1

For debugging, we enable the env var RAY_record_ref_creation_sites.

Problem

When a cluster node (Ray worker) dies, the computation cannot recover and continue.
We see these errors:

(raylet) Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 1826, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1860, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 963, in ray._raylet.raise_if_dependency_failed
ray.exceptions.ReferenceCountingAssertionError: Failed to retrieve object 0031b89f94899135ffffffffffffffffffffffff03000000cee1f505. The ObjectRef was created at: (actor call) 
  /usr/local/lib/python3.11/dist-packages/ray/data/_internal/execution/operators/actor_pool_map_operator.py:_start_actor:154
  /usr/local/lib/python3.11/dist-packages/ray/data/_internal/execution/operators/actor_pool_map_operator.py:scale_up:499
  /usr/local/lib/python3.11/dist-packages/ray/data/_internal/execution/operators/actor_pool_map_operator.py:start:127

The object has already been deleted by the reference counting protocol. This should not happen.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 2271, in ray._raylet.task_execution_handler
  File "python/ray/_raylet.pyx", line 2167, in ray._raylet.execute_task_with_cancellation_handler
  File "python/ray/_raylet.pyx", line 1822, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1823, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 2061, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1054, in ray._raylet.store_task_errors
  File "/usr/local/lib/python3.11/dist-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
    return method(self, *_args, **_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/ray/data/_internal/execution/operators/actor_pool_map_operator.py", line 398, in __repr__
    return f"MapWorker({self.src_fn_name})"
                        ^^^^^^^^^^^^^^^^
AttributeError: '_MapWorker' object has no attribute 'src_fn_name'
An unexpected internal error occurred while the worker was executing a task.

Solved it. The main problem had nothing to do with Ray Data. I was not keeping a reference of an object used in the Ray Data actors, so it was being garbage collected.

However, this error in the error handler remains.