Actor dies in actor pool, causing entire RayJob to fail

I am running a RayJob on KubeRay. I have a driver function spinning up an actor pool but when a single actor dies, the entire Ray job is terminated.

2024-09-26 21:05:57,405	ERR cli.py:72 -- Job 'flow-engine-daily-batch-double-ratio-xc2dw' failed
2024-09-26 21:05:57,405	ERR cli.py:73 -- -------------------------------------------------------
2024-09-26 21:05:57,405	INFO cli.py:86 -- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
The actor died because its node has died. Node Id: 699d13151ee3786638d9e6b4032998bf07c30173d64a9654537f5751
	the actor's node was terminated expectedly: received SIGTERM

A snippet of the Python stack trace:

ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
	class_name: DataFrameWorker
	...
The actor died because its node has died. Node Id: 699d13151ee3786638d9e6b4032998bf07c30173d64a9654537f5751
	the actor's node was terminated expectedly: received SIGTERM

My question is,

  1. Why does the entire Ray job die when a single actor in the actor pool dies?
  2. Actor checkpointing suggests to manually manage the application state - but how does that work when using the map_unordered in the actor pool? Will the output be returned to the list?
  3. How do I get the actor that died, to “remember” the input arg that was passed into it?
  4. Why did the autoscaler terminate the worker node when the worker was still being utilized?

For context, my Ray application follows this setup:

  1. Ray job calls a Python script that runs a single main driver function
  2. The driver function spins up an Actor pool of around 250+ actors, each with 2 cores, 8GB RAM.
  3. I split a few million rows into batches, and use actor_pool.map_unordered(...).
  4. I have around 40+ Ray worker replicas configured with autoscaling:
      - replicas: 40
        minReplicas: 40
        maxReplicas: 100

with resources per worker:

limits:
      cpu: "12"
      memory: "52Gi"
requests:
       cpu: "12"
       memory: "52Gi"
1 Like

I seem to be facing a similar issue, but with the extra complexity that the actors are spawned from Ray Data with a .map call. However, I also see these logs:

(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffb983dca24a07156b39f4a91907000000 Worker ID: 757b37864bdbec611b13c712fd9971fc862319e7ebe24c17c10875a6 Node ID: 64457bb863c0c12f43188fa20ec9da259630d3a12030bf48335ceda2 Worker IP address: 10.244.13.7 Worker port: 10003 Worker PID: 192 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code None. The worker may have exceeded K8s pod memory limits. Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 1826, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1860, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 963, in ray._raylet.raise_if_dependency_failed
ray.exceptions.ReferenceCountingAssertionError: Failed to retrieve object 00062df4ac02846c23278872fba5aac4442eb26c0700000001e1f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

The object has already been deleted by the reference counting protocol. This should not happen.

Using the suggested env var, I’m investigating the missing objectref.