Autoscaler not removing idle workers

bradhilton · April 12, 2023, 6:33pm

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.
Medium: It contributes to significant difficulty to complete my task, but I can work around it.

My cluster is not removing idle workers. Here are two statuses a few minutes apart:

======== Autoscaler status: 2023-04-12 12:25:54.310061 ========
Node status

Healthy:
1 ray_head_default
1 ray_worker_on_demand_large
1 ray_worker_preemptible_small
Pending:
(no pending nodes)
Recent failures:
ray_worker_preemptible_small: RayletUnexpectedlyDied (ip: 10.128.0.36)

Resources

Usage:
0.0/40.0 CPU
0.00/141.962 GiB memory
1.02/61.156 GiB object_store_memory

Demands:
(no resource demands)

======== Autoscaler status: 2023-04-12 12:30:56.685294 ========
Node status

Healthy:
1 ray_head_default
1 ray_worker_on_demand_large
1 ray_worker_preemptible_small
Pending:
(no pending nodes)
Recent failures:
ray_worker_preemptible_small: RayletUnexpectedlyDied (ip: 10.128.0.36)

Resources

Usage:
0.0/40.0 CPU
0.00/141.962 GiB memory
1.02/61.156 GiB object_store_memory

Demands:
(no resource demands)

My config file specifies an idle timeout of 2 minutes. The cluster did seem to remove some other nodes that had been idling, but these two worker nodes are not being terminated. Any advice would be greatly appreciated!

Alex · April 12, 2023, 6:35pm

Do you mind also posting the output of ray status -v?

bradhilton · April 12, 2023, 7:03pm

Slightly different scenario now since I manually scaled down the workers, made some changes and ran my workload again, but running into the same issue. Here’s the first status:

======== Autoscaler status: 2023-04-12 12:54:25.528556 ========
GCS request time: 0.000689s
Node Provider non_terminated_nodes time: 0.104822s

Node status

Healthy:
1 ray_head_default
2 ray_worker_preemptible_medium
1 ray_worker_preemptible_small
Pending:
(no pending nodes)
Recent failures:
ray_worker_on_demand_large: RayletUnexpectedlyDied (ip: 10.128.0.72)
ray_worker_preemptible_small: RayletUnexpectedlyDied (ip: 10.128.0.36)

Resources

Total Usage:
0.0/40.0 CPU
0.00/141.842 GiB memory
1.02/61.105 GiB object_store_memory

Total Demands:
(no resource demands)

Node: 10.128.0.50
Usage:
0.00/4.412 GiB memory
1.02/2.206 GiB object_store_memory

Node: 10.128.0.76
Usage:
0.0/8.0 CPU
0.00/27.436 GiB memory
0.00/11.758 GiB object_store_memory

Node: 10.128.0.77
Usage:
0.0/16.0 CPU
0.00/54.998 GiB memory
0.00/23.570 GiB object_store_memory

Node: 10.128.0.78
Usage:
0.0/16.0 CPU
0.00/54.997 GiB memory
0.00/23.570 GiB object_store_memory

And here is the second status, 5+ minutes later:

======== Autoscaler status: 2023-04-12 13:00:08.810865 ========
GCS request time: 0.000665s
Node Provider non_terminated_nodes time: 0.098682s

Node status

Healthy:
1 ray_head_default
2 ray_worker_preemptible_medium
1 ray_worker_preemptible_small
Pending:
(no pending nodes)
Recent failures:
ray_worker_on_demand_large: RayletUnexpectedlyDied (ip: 10.128.0.72)
ray_worker_preemptible_small: RayletUnexpectedlyDied (ip: 10.128.0.36)

Resources

Total Usage:
0.0/40.0 CPU
0.00/141.842 GiB memory
1.02/61.105 GiB object_store_memory

Total Demands:
(no resource demands)

Node: 10.128.0.50
Usage:
0.00/4.412 GiB memory
1.02/2.206 GiB object_store_memory

Node: 10.128.0.76
Usage:
0.0/8.0 CPU
0.00/27.436 GiB memory
0.00/11.758 GiB object_store_memory

Node: 10.128.0.77
Usage:
0.0/16.0 CPU
0.00/54.998 GiB memory
0.00/23.570 GiB object_store_memory

Node: 10.128.0.78
Usage:
0.0/16.0 CPU
0.00/54.997 GiB memory
0.00/23.570 GiB object_store_memory

Interesting that it seems to be setting a floor at 40 CPUs. There were more nodes that it did, correctly, identify as “idle” and terminate.

Topic		Replies	Views
Autoscaler not shutting down idle nodes. ray 1.3 Ray Clusters	20	1357	June 9, 2021
Ray Cluster Not Scaling Down	7	781	May 4, 2023
[Cluster][Autoscaler-v2]-Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout Kubernetes	0	48	September 10, 2024
How to disable Autoscaler for local cluster Ray Clusters	9	693	March 16, 2023
Ray cluster raylet is down but the worker doesn't come back up Ray Clusters	1	414	November 3, 2022

Autoscaler not removing idle workers

======== Autoscaler status: 2023-04-12 12:25:54.310061 ======== Node status

Resources

======== Autoscaler status: 2023-04-12 12:30:56.685294 ======== Node status

Resources

Node status

Resources

Node status

Resources

Related topics

======== Autoscaler status: 2023-04-12 12:25:54.310061 ========
Node status

======== Autoscaler status: 2023-04-12 12:30:56.685294 ========
Node status