I am running a RayJob on KubeRay. I have a driver function spinning up an actor pool but when a single actor dies, the entire Ray job is terminated.
2024-09-26 21:05:57,405 ERR cli.py:72 -- Job 'flow-engine-daily-batch-double-ratio-xc2dw' failed
2024-09-26 21:05:57,405 ERR cli.py:73 -- -------------------------------------------------------
2024-09-26 21:05:57,405 INFO cli.py:86 -- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
The actor died because its node has died. Node Id: 699d13151ee3786638d9e6b4032998bf07c30173d64a9654537f5751
the actor's node was terminated expectedly: received SIGTERM
A snippet of the Python stack trace:
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: DataFrameWorker
...
The actor died because its node has died. Node Id: 699d13151ee3786638d9e6b4032998bf07c30173d64a9654537f5751
the actor's node was terminated expectedly: received SIGTERM
My question is,
- Why does the entire Ray job die when a single actor in the actor pool dies?
- Actor checkpointing suggests to manually manage the application state - but how does that work when using the
map_unordered
in the actor pool? Will the output be returned to the list? - How do I get the actor that died, to “remember” the input arg that was passed into it?
- Why did the autoscaler terminate the worker node when the worker was still being utilized?
For context, my Ray application follows this setup:
- Ray job calls a Python script that runs a single main driver function
- The driver function spins up an Actor pool of around 250+ actors, each with 2 cores, 8GB RAM.
- I split a few million rows into batches, and use
actor_pool.map_unordered(...)
. - I have around 40+ Ray worker replicas configured with autoscaling:
- replicas: 40
minReplicas: 40
maxReplicas: 100
with resources per worker:
limits:
cpu: "12"
memory: "52Gi"
requests:
cpu: "12"
memory: "52Gi"