Autoscaler node termination behavior when scaled down with helm

asm582 · July 19, 2021, 4:39pm

Hello,

I launch a cluster with autoscaler and at some point, I use helm to change max_workers which leads to downscaling of the cluster. In observation we observe that the nodes with the slowest heartbeat get killed, such nodes are often doing computations hence slow to respond.

is there a way where we can change such a policy?
when a cluster is downscaled, can we search for idle nodes in the cluster and kill them instead of randomly killing ray nodes when downscaled?
when a cluster is downscaled, do we keep accounting information of ray nodes running actors vs tasks?

Dmitri · July 21, 2021, 10:24pm

Idle nodes above the min_workers threshold are terminated after the configurable idle timeout minutes period.

When choosing which nodes to keep to satisfy the min_workers constraint, most recently used nodes are prioritized.

As for nodes above the max_workers timeout, the choice of which node to terminate is effectively undefined (just took a look at code to confirm).
I will open an issue to prioritize keep the most recently used in that decision.

There’s currently no accounting for details of running task/actors while downscaling, other than keeping track of which node was most recently used.

Generally speaking, orderly downscaling is a difficult problem and an area for improvement.

Dmitri · July 21, 2021, 10:28pm

Issue to track current problem: [autoscaler] Prioritize keeping recently used when downscaling due to max_workers · Issue #17248 · ray-project/ray · GitHub

Related: [autoscaler][core] Safe node termination · Issue #16975 · ray-project/ray · GitHub

asm582 · July 22, 2021, 1:44am

@Dmitri Thanks for your comments and opening the issue for tracking.

I think different users may have different requirements for downscaling while the current policy is the slowest heartbeat, I can think of users needing to terminate pods based on host IP or hostname that belong to a cloud provider.
I understand that accounting for running actors vs tasks is not yet in place but even when accounting is done users may also need the different policies to do downscaling for a running workload.

Considering the above scenarios, does it make sense to expose an interface in ray where users based on the requirement can extend or implement the interface that provides the desired policy?

Dmitri · July 22, 2021, 2:42am

Such an interface sounds like a good idea. However, before building a general interface for this, we’d want to add functionality that addresses particular use cases.
If you have ideas for what such an interface would look, please do add to the discussion!

As a side, it seems that the Kubernetes project is only recently getting around to exposing knobs to control scale down of ReplicaSets. (Side because the Ray operator does not currently use k8s controllers for scaling.)

Topic		Replies	Views
Autoscaler scale down slow Ray Core	3	333	March 1, 2021
Ray Cluster Not Scaling Down	7	804	May 4, 2023
Stuck trying to take down workers Kubernetes	22	2388	March 29, 2021
Autoscaler not shutting down idle nodes. ray 1.3 Ray Clusters	20	1432	June 9, 2021
Autoscaler does not seem to watch head node Kubernetes	5	746	March 26, 2021

Autoscaler node termination behavior when scaled down with helm

Related topics