Actor dies in actor pool, causing entire RayJob to fail

I am running a RayJob on KubeRay. I have a driver function spinning up an actor pool but when a single actor dies, the entire Ray job is terminated.

2024-09-26 21:05:57,405	ERR cli.py:72 -- Job 'flow-engine-daily-batch-double-ratio-xc2dw' failed
2024-09-26 21:05:57,405	ERR cli.py:73 -- -------------------------------------------------------
2024-09-26 21:05:57,405	INFO cli.py:86 -- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
The actor died because its node has died. Node Id: 699d13151ee3786638d9e6b4032998bf07c30173d64a9654537f5751
	the actor's node was terminated expectedly: received SIGTERM

A snippet of the Python stack trace:

ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
	class_name: DataFrameWorker
	...
The actor died because its node has died. Node Id: 699d13151ee3786638d9e6b4032998bf07c30173d64a9654537f5751
	the actor's node was terminated expectedly: received SIGTERM

My question is,

  1. Why does the entire Ray job die when a single actor in the actor pool dies?
  2. Actor checkpointing suggests to manually manage the application state - but how does that work when using the map_unordered in the actor pool? Will the output be returned to the list?
  3. How do I get the actor that died, to “remember” the input arg that was passed into it?
  4. Why did the autoscaler terminate the worker node when the worker was still being utilized?

For context, my Ray application follows this setup:

  1. Ray job calls a Python script that runs a single main driver function
  2. The driver function spins up an Actor pool of around 250+ actors, each with 2 cores, 8GB RAM.
  3. I split a few million rows into batches, and use actor_pool.map_unordered(...).
  4. I have around 40+ Ray worker replicas configured with autoscaling:
      - replicas: 40
        minReplicas: 40
        maxReplicas: 100

with resources per worker:

limits:
      cpu: "12"
      memory: "52Gi"
requests:
       cpu: "12"
       memory: "52Gi"