I am running Ray on EC2. I have spun up a cluster and submitted a very simple (but large) test job. The cluster scales up as desired to a few hundred cpus / few dozen nodes, and the job completes successfully. The problem is that the cluster doesn’t seem to be scaling down.
I have in the config set the idle timeout minutes =1 (this was the default), so I would expect these nodes to die after a minute or two, but they are still hanging around, and thus costing money, after 30 min or more. I think my main questions are
what precisely defines “idle” in terms of killing a node and
how can I ensure that my nodes meet that definition once the cluster is done with my job?
Hi Huaiwei,
I don’t know what changed, but it seems they are scaling down ok now. Thank you for the attention and I apologize for the noise.
alas, the problem is back… here are the logs of ray status -v so I can see exactly what’s happening. These nodes are still alive in EC2 despite the job finishing almost 20 minutes ago.
I’m still having this issue and it’s a huge blocker. Maybe it’s from tasks that failed with an unhandled exception?
There are no jobs running, no tasks running, no client connections open, and yet my workers are still somehow using like half their CPUs. How can I avoid this or forcibly stop them without tearing down the whole cluster?