Ray Cluster Not Scaling Down

Hi,

I am running Ray on EC2. I have spun up a cluster and submitted a very simple (but large) test job. The cluster scales up as desired to a few hundred cpus / few dozen nodes, and the job completes successfully. The problem is that the cluster doesn’t seem to be scaling down.

I have in the config set the idle timeout minutes =1 (this was the default), so I would expect these nodes to die after a minute or two, but they are still hanging around, and thus costing money, after 30 min or more. I think my main questions are

  1. what precisely defines “idle” in terms of killing a node and
  2. how can I ensure that my nodes meet that definition once the cluster is done with my job?

Thanks

RE 1: Configuring Autoscaling — Ray 2.8.0

A node is considered idle if it has no active tasks, actors, or objects.

RE 2: one common reason for why a node doesn’t scale down is that some objects are still in scope and kept on that node.

  1. Can you take a look at your ray dashboard and view the object storage usage of those nodes
  2. Can you also run ray status -v on the head node?

Hi Huaiwei,
I don’t know what changed, but it seems they are scaling down ok now. Thank you for the attention and I apologize for the noise.

alas, the problem is back… here are the logs of ray status -v so I can see exactly what’s happening. These nodes are still alive in EC2 despite the job finishing almost 20 minutes ago.

Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
Pending:
 172.31.30.37: ray.worker.default, uninitialized
 172.31.25.130: ray.worker.default, uninitialized
 172.31.17.114: ray.worker.default, uninitialized
 172.31.18.145: ray.worker.default, uninitialized
 172.31.16.129: ray.worker.default, uninitialized
 172.31.25.95: ray.worker.default, uninitialized
 172.31.23.69: ray.worker.default, uninitialized
 172.31.22.181: ray.worker.default, uninitialized
 172.31.18.86: ray.worker.default, uninitialized
 172.31.17.45: ray.worker.default, uninitialized
 172.31.17.154: ray.worker.default, uninitialized
 172.31.31.77: ray.worker.default, uninitialized
 172.31.17.89: ray.worker.default, uninitialized
Recent failures:

and

Resources
---------------------------------------------------------------
Total Usage:
 0.0/138.0 CPU
 0.00/385.147 GiB memory
 0.00/159.228 GiB object_store_memory
Total Demands:
 (no resource demands)
Node: 172.31.24.0
 Usage:
  0.0/2.0 CPU
  0.00/4.347 GiB memory
  0.00/2.174 GiB object_store_memory
Node: 172.31.30.7
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.234 GiB object_store_memory
Node: 172.31.25.214
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.25.145
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.31.57
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.28.185
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.234 GiB object_store_memory
Node: 172.31.22.119
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.30.221
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.23.184
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.30.184
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.28.52
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.19.159
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.16.132
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.30.108
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.234 GiB object_store_memory
Node: 172.31.24.97
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.19.26
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.21.139
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.148 GiB object_store_memory
Node: 172.31.17.61
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
(base) ray@ip-172-31-24-0:~$

Maybe somehow the head node lost connection to these workers so they are isolated / zombies? How can I prevent that?

bump — is there anything I can do to guarantee it will scale down all the way?

I’m still having this issue and it’s a huge blocker. Maybe it’s from tasks that failed with an unhandled exception?

There are no jobs running, no tasks running, no client connections open, and yet my workers are still somehow using like half their CPUs. How can I avoid this or forcibly stop them without tearing down the whole cluster?

Sorry. The messages escaped my inbox.

This is weird. The nodes seem in a broken state.

@adienes Can you share your log directory here?

cc: @yic to take a look from the core side

@adienes could you check monitor.log to see what’s going on too?

Loop in @Alex_Wu who is expert in autoscaler.