Autoscaler node termination behavior when scaled down with helm


I launch a cluster with autoscaler and at some point, I use helm to change max_workers which leads to downscaling of the cluster. In observation we observe that the nodes with the slowest heartbeat get killed, such nodes are often doing computations hence slow to respond.

  1. is there a way where we can change such a policy?
  2. when a cluster is downscaled, can we search for idle nodes in the cluster and kill them instead of randomly killing ray nodes when downscaled?
  3. when a cluster is downscaled, do we keep accounting information of ray nodes running actors vs tasks?

Idle nodes above the min_workers threshold are terminated after the configurable idle timeout minutes period.

When choosing which nodes to keep to satisfy the min_workers constraint, most recently used nodes are prioritized.

As for nodes above the max_workers timeout, the choice of which node to terminate is effectively undefined (just took a look at code to confirm).
I will open an issue to prioritize keep the most recently used in that decision.

There’s currently no accounting for details of running task/actors while downscaling, other than keeping track of which node was most recently used.

Generally speaking, orderly downscaling is a difficult problem and an area for improvement.

Issue to track current problem: [autoscaler] Prioritize keeping recently used when downscaling due to max_workers · Issue #17248 · ray-project/ray · GitHub

Related: [autoscaler][core] Safe node termination · Issue #16975 · ray-project/ray · GitHub

@Dmitri Thanks for your comments and opening the issue for tracking.

  • I think different users may have different requirements for downscaling while the current policy is the slowest heartbeat, I can think of users needing to terminate pods based on host IP or hostname that belong to a cloud provider.
  • I understand that accounting for running actors vs tasks is not yet in place but even when accounting is done users may also need the different policies to do downscaling for a running workload.

Considering the above scenarios, does it make sense to expose an interface in ray where users based on the requirement can extend or implement the interface that provides the desired policy?

Such an interface sounds like a good idea. However, before building a general interface for this, we’d want to add functionality that addresses particular use cases.
If you have ideas for what such an interface would look, please do add to the discussion!

As a side, it seems that the Kubernetes project is only recently getting around to exposing knobs to control scale down of ReplicaSets. (Side because the Ray operator does not currently use k8s controllers for scaling.)