[Cluster][Autoscaler-v2]-Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout

vbalumuri · September 10, 2024, 2:55pm

Github issue link: [Cluster][Autoscaler-v2]-Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout · Issue #47578 · ray-project/ray · GitHub

What happened + What you expected to happen

The bug:
When the ray cluster is idle, ray autoscaler v2 constantly tries to terminate worker nodes ignoring the set minimum worker nodes count. This causes the ray worker nodes count to go below the minimum set count from the kuberay chart.
Example: consider a ray cluster provisioned in kubernetes using the kuberay chart 1.1.0 with autoscaler v2 enabled and minimum worker nodes setting of 3. When idle, autoscaler terminates the worker nodes after idle timeout seconds causing the active worker node count to fall below 3 (1 or 2 or 0 sometimes for a brief period)

Due to this constant terminate and recreate cycle, sometimes we see the Actors failing as follows:

ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
    class_name: DB
    actor_id: c78dd85202773ed8f736b21c72000000
    namespace: 81a40ef8-077d-4f65-be29-4d4077c23a85
The actor died because its node has died. Node Id: xxxxxx 
   the actor's node was terminated: Termination of node that's idle for 244.98 seconds.
The actor never ran - it was cancelled before it started running.

Expected behavior:
Autoscaler should honor the minimum worker nodes count setting and should not terminate the nodes when the termination causes node count to fall below the set value. This was properly happening in autoscaler v1 and this issue was introduced after upgrading to autoscaler v2.

Autoscaler logs:

2024-09-09 09:11:41,933	INFO cloud_provider.py:448 -- Listing pods for RayCluster xxxxxxxtest-kuberay in namespace xxxxxxx-test at pods resource version >= 1198596159.
2024-09-09 09:11:41,946 - INFO - Fetched pod data at resource version 1198599914.
2024-09-09 09:11:41,946	INFO cloud_provider.py:466 -- Fetched pod data at resource version 1198599914.
2024-09-09 09:11:41,949 - INFO - Removing 2 nodes of type workergroup (idle).
2024-09-09 09:11:41,949	INFO event_logger.py:76 -- Removing 2 nodes of type workergroup (idle).
2024-09-09 09:11:41,950 - INFO - Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=c6d6af81-80b8-46b8-8694-24fb726c02dd, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-7kzpp, ray_id=0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2): draining ray: idle for 242.003 secs > timeout=240.0 secs
2024-09-09 09:11:41,950	INFO instance_manager.py:262 -- Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=c6d6af81-80b8-46b8-8694-24fb726c02dd, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-7kzpp, ray_id=0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2): draining ray: idle for 242.003 secs > timeout=240.0 secs
2024-09-09 09:11:41,950 - INFO - Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=a93be1d8-dd1e-489c-8f50-6b03087e5280, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-xn279, ray_id=1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914): draining ray: idle for 242.986 secs > timeout=240.0 secs
2024-09-09 09:11:41,950	INFO instance_manager.py:262 -- Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=a93be1d8-dd1e-489c-8f50-6b03087e5280, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-xn279, ray_id=1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914): draining ray: idle for 242.986 secs > timeout=240.0 secs
2024-09-09 09:11:41,955 - INFO - Drained ray on 0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2(success=True, msg=)
2024-09-09 09:11:41,955	INFO ray_stopper.py:116 -- Drained ray on 0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2(success=True, msg=)
2024-09-09 09:11:41,959 - INFO - Drained ray on 1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914(success=True, msg=)
2024-09-09 09:11:41,959	INFO ray_stopper.py:116 -- Drained ray on 1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914(success=True, msg=)
2024-09-09 09:11:46,986 - INFO - Calculating hashes for file mounts and ray commands.
2024-09-09 09:11:46,986	INFO config.py:182 -- Calculating hashes for file mounts and ray commands.
2024-09-09 09:11:47,015 - INFO - Listing pods for RayCluster xxxxxxxtest-kuberay in namespace xxxxxxx-test at pods resource version >= 1198596159.
2024-09-09 09:11:47,015	INFO cloud_provider.py:448 -- Listing pods for RayCluster xxxxxxxtest-kuberay in namespace xxxxxxx-test at pods resource version >= 1198596159.
2024-09-09 09:11:47,028 - INFO - Fetched pod data at resource version 1198600006.
2024-09-09 09:11:47,028	INFO cloud_provider.py:466 -- Fetched pod data at resource version 1198600006.
2024-09-09 09:11:47,030 - INFO - Update instance RAY_STOP_REQUESTED->RAY_STOPPED (id=c6d6af81-80b8-46b8-8694-24fb726c02dd, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-7kzpp, ray_id=0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2): ray node 0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2 is DEAD
2024-09-09 09:11:47,030	INFO instance_manager.py:262 -- Update instance RAY_STOP_REQUESTED->RAY_STOPPED (id=c6d6af81-80b8-46b8-8694-24fb726c02dd, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-7kzpp, ray_id=0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2): ray node 0f379b4e4c7957d692711e1f505c25425befe61a08c54ee80fc407f2 is DEAD
2024-09-09 09:11:47,030 - INFO - Update instance RAY_STOP_REQUESTED->RAY_STOPPED (id=a93be1d8-dd1e-489c-8f50-6b03087e5280, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-xn279, ray_id=1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914): ray node 1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914 is DEAD
2024-09-09 09:11:47,030	INFO instance_manager.py:262 -- Update instance RAY_STOP_REQUESTED->RAY_STOPPED (id=a93be1d8-dd1e-489c-8f50-6b03087e5280, type=workergroup, cloud_instance_id=xxxxxxxtest-kuberay-worker-workergroup-xn279, ray_id=1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914): ray node 1bfae5adba89e5437380731e5d53f34b31862e9e46e554512e52e914 is DEAD
2024-09-09 09:11:47,031 - INFO - Adding 2 nodes to satisfy min count for node type: workergroup.
2024-09-09 09:11:47,031	INFO scheduler.py:1148 -- Adding 2 nodes to satisfy min count for node type: workergroup.
2024-09-09 09:11:47,032 - INFO - Adding 2 node(s) of type workergroup.
2024-09-09 09:11:47,032	INFO event_logger.py:56 -- Adding 2 node(s) of type workergroup.

Versions / Dependencies

kuberay: v1.1.0
ray-cluster helm chart: v1.1.0
ray version: 2.34.0

Reproduction script

You can use the base kuberay helm chart to deploy a ray cluster with the following changes:

head:
  enableInTreeAutoscaling: true
  autoscalerOptions:
    upscalingMode: Conservative
    idleTimeoutSeconds: 240
  containerEnv:
    - name: RAY_enable_autoscaler_v2
      value: "1"

worker:
  replicas: 3
  minReplicas: 3
  maxReplicas: 10

Issue Severity

High: It blocks me from completing my task.

Topic		Replies	Views
Autoscaler not shutting down idle nodes. ray 1.3 Ray Clusters	20	1289	June 9, 2021
[Autoscaler] Autoscaler behavior for changes to min_workers for deployed cluster Ray Clusters	2	319	June 3, 2021
Autoscaler node termination behavior when scaled down with helm Kubernetes	4	761	July 22, 2021
Autoscaler SDK request_resoures fails on EKS Kubernetes	8	570	February 16, 2021
RayOutOfMemoryError: Why is autoscaler not creating new pods? Kubernetes	3	948	April 28, 2022

[Cluster][Autoscaler-v2]-Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Related topics