Autoscaler not removing idle workers

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.
  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

My cluster is not removing idle workers. Here are two statuses a few minutes apart:

======== Autoscaler status: 2023-04-12 12:25:54.310061 ========
Node status

Healthy:
1 ray_head_default
1 ray_worker_on_demand_large
1 ray_worker_preemptible_small
Pending:
(no pending nodes)
Recent failures:
ray_worker_preemptible_small: RayletUnexpectedlyDied (ip: 10.128.0.36)

Resources

Usage:
0.0/40.0 CPU
0.00/141.962 GiB memory
1.02/61.156 GiB object_store_memory

Demands:
(no resource demands)

======== Autoscaler status: 2023-04-12 12:30:56.685294 ========
Node status

Healthy:
1 ray_head_default
1 ray_worker_on_demand_large
1 ray_worker_preemptible_small
Pending:
(no pending nodes)
Recent failures:
ray_worker_preemptible_small: RayletUnexpectedlyDied (ip: 10.128.0.36)

Resources

Usage:
0.0/40.0 CPU
0.00/141.962 GiB memory
1.02/61.156 GiB object_store_memory

Demands:
(no resource demands)

My config file specifies an idle timeout of 2 minutes. The cluster did seem to remove some other nodes that had been idling, but these two worker nodes are not being terminated. Any advice would be greatly appreciated!

Do you mind also posting the output of ray status -v?

Slightly different scenario now since I manually scaled down the workers, made some changes and ran my workload again, but running into the same issue. Here’s the first status:

======== Autoscaler status: 2023-04-12 12:54:25.528556 ========
GCS request time: 0.000689s
Node Provider non_terminated_nodes time: 0.104822s

Node status

Healthy:
1 ray_head_default
2 ray_worker_preemptible_medium
1 ray_worker_preemptible_small
Pending:
(no pending nodes)
Recent failures:
ray_worker_on_demand_large: RayletUnexpectedlyDied (ip: 10.128.0.72)
ray_worker_preemptible_small: RayletUnexpectedlyDied (ip: 10.128.0.36)

Resources

Total Usage:
0.0/40.0 CPU
0.00/141.842 GiB memory
1.02/61.105 GiB object_store_memory

Total Demands:
(no resource demands)

Node: 10.128.0.50
Usage:
0.00/4.412 GiB memory
1.02/2.206 GiB object_store_memory

Node: 10.128.0.76
Usage:
0.0/8.0 CPU
0.00/27.436 GiB memory
0.00/11.758 GiB object_store_memory

Node: 10.128.0.77
Usage:
0.0/16.0 CPU
0.00/54.998 GiB memory
0.00/23.570 GiB object_store_memory

Node: 10.128.0.78
Usage:
0.0/16.0 CPU
0.00/54.997 GiB memory
0.00/23.570 GiB object_store_memory

And here is the second status, 5+ minutes later:

======== Autoscaler status: 2023-04-12 13:00:08.810865 ========
GCS request time: 0.000665s
Node Provider non_terminated_nodes time: 0.098682s

Node status

Healthy:
1 ray_head_default
2 ray_worker_preemptible_medium
1 ray_worker_preemptible_small
Pending:
(no pending nodes)
Recent failures:
ray_worker_on_demand_large: RayletUnexpectedlyDied (ip: 10.128.0.72)
ray_worker_preemptible_small: RayletUnexpectedlyDied (ip: 10.128.0.36)

Resources

Total Usage:
0.0/40.0 CPU
0.00/141.842 GiB memory
1.02/61.105 GiB object_store_memory

Total Demands:
(no resource demands)

Node: 10.128.0.50
Usage:
0.00/4.412 GiB memory
1.02/2.206 GiB object_store_memory

Node: 10.128.0.76
Usage:
0.0/8.0 CPU
0.00/27.436 GiB memory
0.00/11.758 GiB object_store_memory

Node: 10.128.0.77
Usage:
0.0/16.0 CPU
0.00/54.998 GiB memory
0.00/23.570 GiB object_store_memory

Node: 10.128.0.78
Usage:
0.0/16.0 CPU
0.00/54.997 GiB memory
0.00/23.570 GiB object_store_memory

Interesting that it seems to be setting a floor at 40 CPUs. There were more nodes that it did, correctly, identify as “idle” and terminate.