What happened + What you expected to happen
The bug:
When the ray cluster is idle, ray autoscaler v2 constantly tries to terminate worker nodes ignoring the set minimum worker nodes count. This causes the ray worker nodes count to go below the minimum set count from the kuberay chart.
Example: consider a ray cluster provisioned in kubernetes using the kuberay chart 1.1.0 with autoscaler v2 enabled and minimum worker nodes setting of 3. When idle, autoscaler terminates the worker nodes after idle timeout seconds causing the active worker node count to fall below 3 (1 or 2 or 0 sometimes for a brief period)
Due to this constant terminate and recreate cycle, sometimes we see the Actors failing as follows:
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: DB
actor_id: c78dd85202773ed8f736b21c72000000
namespace: 81a40ef8-077d-4f65-be29-4d4077c23a85
The actor died because its node has died. Node Id: xxxxxx
the actor's node was terminated: Termination of node that's idle for 244.98 seconds.
The actor never ran - it was cancelled before it started running.
Expected behavior:
Autoscaler should honor the minimum worker nodes count setting and should not terminate the nodes when the termination causes node count to fall below the set value. This was properly happening in autoscaler v1 and this issue was introduced after upgrading to autoscaler v2.
Autoscaler logs:
2024-09-09 09:11:41,933 INFO cloud_provider.py:448 -- Listing pods for RayCluster xxxxxxxtest-kuberay in namespace xxxxxxx-test at pods resource version >= 1198596159.
2024-09-09 09:11:41,946 - INFO - Fetched pod data at resource version 1198599914.
2024-09-09 09:11:41,946 INFO cloud_provider.py:466 -- Fetched pod data at resource version 1198599914.
2024-09-09 09:11:41,949 - INFO - Removing 2 nodes of type workergroup (idle).
2024-09-09 09:11:41,949 INFO event_logger.py:76 -- Removing 2 nodes of type workergroup (idle).
2024-09-09 09:11:41,950 - INFO - Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=c6d6af81-80b8-46b8-8694-24fb726c02dd, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-7kzpp, ray_id=0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2): draining ray: idle for 242.003 secs > timeout=240.0 secs
2024-09-09 09:11:41,950 INFO instance_manager.py:262 -- Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=c6d6af81-80b8-46b8-8694-24fb726c02dd, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-7kzpp, ray_id=0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2): draining ray: idle for 242.003 secs > timeout=240.0 secs
2024-09-09 09:11:41,950 - INFO - Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=a93be1d8-dd1e-489c-8f50-6b03087e5280, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-xn279, ray_id=1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914): draining ray: idle for 242.986 secs > timeout=240.0 secs
2024-09-09 09:11:41,950 INFO instance_manager.py:262 -- Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=a93be1d8-dd1e-489c-8f50-6b03087e5280, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-xn279, ray_id=1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914): draining ray: idle for 242.986 secs > timeout=240.0 secs
2024-09-09 09:11:41,955 - INFO - Drained ray on 0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2(success=True, msg=)
2024-09-09 09:11:41,955 INFO ray_stopper.py:116 -- Drained ray on 0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2(success=True, msg=)
2024-09-09 09:11:41,959 - INFO - Drained ray on 1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914(success=True, msg=)
2024-09-09 09:11:41,959 INFO ray_stopper.py:116 -- Drained ray on 1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914(success=True, msg=)
2024-09-09 09:11:46,986 - INFO - Calculating hashes for file mounts and ray commands.
2024-09-09 09:11:46,986 INFO config.py:182 -- Calculating hashes for file mounts and ray commands.
2024-09-09 09:11:47,015 - INFO - Listing pods for RayCluster xxxxxxxtest-kuberay in namespace xxxxxxx-test at pods resource version >= 1198596159.
2024-09-09 09:11:47,015 INFO cloud_provider.py:448 -- Listing pods for RayCluster xxxxxxxtest-kuberay in namespace xxxxxxx-test at pods resource version >= 1198596159.
2024-09-09 09:11:47,028 - INFO - Fetched pod data at resource version 1198600006.
2024-09-09 09:11:47,028 INFO cloud_provider.py:466 -- Fetched pod data at resource version 1198600006.
2024-09-09 09:11:47,030 - INFO - Update instance RAY_STOP_REQUESTED->RAY_STOPPED (id=c6d6af81-80b8-46b8-8694-24fb726c02dd, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-7kzpp, ray_id=0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2): ray node 0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2 is DEAD
2024-09-09 09:11:47,030 INFO instance_manager.py:262 -- Update instance RAY_STOP_REQUESTED->RAY_STOPPED (id=c6d6af81-80b8-46b8-8694-24fb726c02dd, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-7kzpp, ray_id=0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2): ray node 0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2 is DEAD
2024-09-09 09:11:47,030 - INFO - Update instance RAY_STOP_REQUESTED->RAY_STOPPED (id=a93be1d8-dd1e-489c-8f50-6b03087e5280, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-xn279, ray_id=1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914): ray node 1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914 is DEAD
2024-09-09 09:11:47,030 INFO instance_manager.py:262 -- Update instance RAY_STOP_REQUESTED->RAY_STOPPED (id=a93be1d8-dd1e-489c-8f50-6b03087e5280, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-xn279, ray_id=1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914): ray node 1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914 is DEAD
2024-09-09 09:11:47,031 - INFO - Adding 2 nodes to satisfy min count for node type: workergroup.
2024-09-09 09:11:47,031 INFO scheduler.py:1148 -- Adding 2 nodes to satisfy min count for node type: workergroup.
2024-09-09 09:11:47,032 - INFO - Adding 2 node(s) of type workergroup.
2024-09-09 09:11:47,032 INFO event_logger.py:56 -- Adding 2 node(s) of type workergroup.
Versions / Dependencies
kuberay: v1.1.0
ray-cluster helm chart: v1.1.0
ray version: 2.34.0
Reproduction script
You can use the base kuberay helm chart to deploy a ray cluster with the following changes:
head:
enableInTreeAutoscaling: true
autoscalerOptions:
upscalingMode: Conservative
idleTimeoutSeconds: 240
containerEnv:
- name: RAY_enable_autoscaler_v2
value: "1"
worker:
replicas: 3
minReplicas: 3
maxReplicas: 10
Issue Severity
High: It blocks me from completing my task.