Ray Cluster Not Scaling Down

adienes · March 23, 2023, 1:11am

Hi,

I am running Ray on EC2. I have spun up a cluster and submitted a very simple (but large) test job. The cluster scales up as desired to a few hundred cpus / few dozen nodes, and the job completes successfully. The problem is that the cluster doesn’t seem to be scaling down.

I have in the config set the idle timeout minutes =1 (this was the default), so I would expect these nodes to die after a minute or two, but they are still hanging around, and thus costing money, after 30 min or more. I think my main questions are

what precisely defines “idle” in terms of killing a node and
how can I ensure that my nodes meet that definition once the cluster is done with my job?

Thanks

Huaiwei_Sun · March 23, 2023, 2:15am

RE 1: Configuring Autoscaling — Ray 2.8.0

A node is considered idle if it has no active tasks, actors, or objects.

RE 2: one common reason for why a node doesn’t scale down is that some objects are still in scope and kept on that node.

Can you take a look at your ray dashboard and view the object storage usage of those nodes
Can you also run ray status -v on the head node?

adienes · March 23, 2023, 2:07pm

Hi Huaiwei,
I don’t know what changed, but it seems they are scaling down ok now. Thank you for the attention and I apologize for the noise.

alas, the problem is back… here are the logs of ray status -v so I can see exactly what’s happening. These nodes are still alive in EC2 despite the job finishing almost 20 minutes ago.

Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
Pending:
 172.31.30.37: ray.worker.default, uninitialized
 172.31.25.130: ray.worker.default, uninitialized
 172.31.17.114: ray.worker.default, uninitialized
 172.31.18.145: ray.worker.default, uninitialized
 172.31.16.129: ray.worker.default, uninitialized
 172.31.25.95: ray.worker.default, uninitialized
 172.31.23.69: ray.worker.default, uninitialized
 172.31.22.181: ray.worker.default, uninitialized
 172.31.18.86: ray.worker.default, uninitialized
 172.31.17.45: ray.worker.default, uninitialized
 172.31.17.154: ray.worker.default, uninitialized
 172.31.31.77: ray.worker.default, uninitialized
 172.31.17.89: ray.worker.default, uninitialized
Recent failures:

and

Resources
---------------------------------------------------------------
Total Usage:
 0.0/138.0 CPU
 0.00/385.147 GiB memory
 0.00/159.228 GiB object_store_memory
Total Demands:
 (no resource demands)
Node: 172.31.24.0
 Usage:
  0.0/2.0 CPU
  0.00/4.347 GiB memory
  0.00/2.174 GiB object_store_memory
Node: 172.31.30.7
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.234 GiB object_store_memory
Node: 172.31.25.214
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.25.145
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.31.57
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.28.185
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.234 GiB object_store_memory
Node: 172.31.22.119
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.30.221
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.23.184
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.30.184
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.28.52
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.19.159
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.16.132
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.30.108
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.234 GiB object_store_memory
Node: 172.31.24.97
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.19.26
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
Node: 172.31.21.139
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.148 GiB object_store_memory
Node: 172.31.17.61
 Usage:
  0.0/8.0 CPU
  0.00/22.400 GiB memory
  0.00/9.246 GiB object_store_memory
(base) ray@ip-172-31-24-0:~$

adienes · March 23, 2023, 2:43pm

Maybe somehow the head node lost connection to these workers so they are isolated / zombies? How can I prevent that?

adienes · March 24, 2023, 3:43pm

bump — is there anything I can do to guarantee it will scale down all the way?

adienes · April 28, 2023, 6:37pm

I’m still having this issue and it’s a huge blocker. Maybe it’s from tasks that failed with an unhandled exception?

There are no jobs running, no tasks running, no client connections open, and yet my workers are still somehow using like half their CPUs. How can I avoid this or forcibly stop them without tearing down the whole cluster?

Huaiwei_Sun · May 3, 2023, 7:19am

Sorry. The messages escaped my inbox.

This is weird. The nodes seem in a broken state.

@adienes Can you share your log directory here?

cc: @yic to take a look from the core side

yic · May 4, 2023, 12:40am

@adienes could you check monitor.log to see what’s going on too?

Loop in @Alex_Wu who is expert in autoscaler.

Topic		Replies	Views
Autoscaler not shutting down idle nodes. ray 1.3 Ray Clusters	20	1435	June 9, 2021
Ray worker nodes won't scale down Ray Core	2	410	November 30, 2023
Autoscaler not removing idle workers Ray Clusters	2	759	April 12, 2023
How to disable Autoscaler for local cluster Ray Clusters	9	731	March 16, 2023
Autoscaler node termination behavior when scaled down with helm Kubernetes	4	781	July 22, 2021

Ray Cluster Not Scaling Down

Related topics