Question about the "cache_stopped_nodes" option in the cluster yaml file

M_S · May 24, 2022, 9:32am

How severe does this issue affect your experience of using Ray?

None: Just asking a question out of curiosity

I am setting up a ray cluster on Azure.
There are 2 options in the yaml file, about which I am wondering:
cache_stopped_nodes
and
idle_timeout_minutes

I have 2 observations:

When I have cache_stopped_nodes=True and idle_timeout_minutes=1, then after roughly one minute of not being used the worker nodes are deallocated. This is what I expected. The same happens on ray down cluster.yaml, everything is deallocated rather fast.
When I have cache_stopped_nodes=False and idle_timeout_minutes=1, the removal of the nodes takes much longer (roughly 5-10 minutes). Also, I don’t see any indication in Azure that the nodes are about to be deleted. The same is true for ray down cluster.yaml, when the head is already deleted and no longer listed in Azure, I still see the workers there, and they have Running status. They are removed after a while though.

My question is: How are the workers actually deleted and why does it take so long for the second option?

Thank you!

cade · May 24, 2022, 8:34pm

@Dmitri hey, could you take a look at this?

Dmitri · May 25, 2022, 1:53am

I’m going to tag in one of the maintainers of the Azure Node Provider – cc @gramhagen .

I can also provide code references. The autoscaler interacts with cloud providers via a Python interface called NodeProvider. The relevant piece of code is the Azure implementation of the interface. The method the handles node deletion is terminate_node. I do see in that method’s code that it wait for deletion of the virtual machine ray/node_provider.py at fa32cb7c401f5afa9a7cfb160c22aa62578af1ee · ray-project/ray · GitHub

I wonder if there’s any way to handle that deletion asynchronously?

M_S · May 25, 2022, 7:42am

Hi there,

thanks for the answer.
When I run ray down cluster.yaml after a moment ray is actually done with it and claims that the nodes were deleted.
However, in the Azure portal they are still there. If I wait a few minutes, they are gone.
Is this just something Azure internal? If ray waits until the node actually is deleted this should not happen, right?

Thanks!

Dmitri · May 25, 2022, 6:08pm

I see, that is a little odd. It may be an Azure-internal issue.

I guess de-provisioning a VM is a multi-stage process. Maybe, once ray down exits, the VMs are in some sense “marked for deletion” and it takes Azure a while to transition the status and eliminate the VM.
It’s not obvious to me precisely what is being waited for in the API call I linked above.

It’s possible there’s a way to avoid this by changing the implementation or using the Azure APIs differently – I’m not familiar with these APIs though.

gramhagen · May 27, 2022, 12:48pm

That wait function might need a timeout input if it’s not actually waiting for deletion to occur. I was assuming it would wait indefinitely as is.

Dmitri · June 8, 2022, 8:22am

Anything that should be changed in that terminate_node method?

cade · June 15, 2022, 9:39pm

@gramhagen is this a bug? I can create a bug report on behalf of @M_S.

gramhagen · June 16, 2022, 1:17pm

yeah i guess this is a bug. so your expectation is this call will block until the vm is deleted? I think that is what was intended as well. personally I’d prefer to let deletion happen in the backgroun since it can take a few minutes, but it’s probably better to match the same behavior across clouds, and we can add support for a timeout if we need to where it continues regardless of the current state of deletion.

cade · June 17, 2022, 12:11am

@gramhagen I think the expectation is that when cache_stopped_nodes=False the node provider does not wait for VM termination. Right now the experience is that cache_stopped_nodes=True means very fast deallocation of the node (since it’s just stopping it), but cache_stopped_nodes=False waits for the node to completely terminate, which takes more time.

I gather from reading the GCP node provider that they also block until termination completes [1]. But AWS terminates [2] the instance when cache_stopped_nodes=False and doesn’t wait for the VMs to finish termination.

So the question is, which is a less surprising experience for users? It seems to me that blocking on termination is more surprising than what AWS does, but open to other opinions.

cc @Dmitri

[1] GCP node provider delete call, which blocks by default. ray/node.py at fa32cb7c401f5afa9a7cfb160c22aa62578af1ee · ray-project/ray · GitHub
[2] AWS node provider terminate call, which doesn’t block. ray/node_provider.py at fa32cb7c401f5afa9a7cfb160c22aa62578af1ee · ray-project/ray · GitHub

Dmitri · June 17, 2022, 1:10am

I also think it’s more natural to return immediately – as long as the cloud provider records and makes visible the “terminating” status very soon after the API call.

gramhagen · June 17, 2022, 4:26pm

So, no change needed? I don’t think we have node caching implemented though.

cade · June 17, 2022, 6:58pm

The change that’s needed is to remove the wait on this line ray/node_provider.py at fa32cb7c401f5afa9a7cfb160c22aa62578af1ee · ray-project/ray · GitHub

gramhagen · June 20, 2022, 4:04pm

i’m not even sure it’s actually waiting since there’s no value supplied to the wait method =)

cade · June 21, 2022, 5:36pm

Hi @gramhagen, do you have time to discuss offline? I am new to the Azure SDK so chances are high that I am missing something. I traced the delete(...).wait() call in ray-project/ray to the Azure SDK and it looks like by default it invokes thread.join(timeout), which Python docs says will block if timeout is None. This means to me that the wait call will block until the delete completes, which causes the behavior which is surprising and is the source of this Discuss question.

thread.join call: azure-sdk-for-python/_poller.py at main · Azure/azure-sdk-for-python · GitHub

Instantiation of LROPoller object: azure-sdk-for-python/_virtual_machines_operations.py at main · Azure/azure-sdk-for-python · GitHub

gramhagen · June 21, 2022, 7:00pm

no, this sounds right to me, i’m not super familiar with the azure sdk either and was just interpreting from the docs. I agree the right change is just to drop the wait after deallocation.

cade · June 21, 2022, 8:55pm

Ok, I created and tagged you in the GitHub bug report [Azure cluster launcher] Deletion takes ~10 minutes without stopped node caching, fast with stopped node caching · Issue #25971 · ray-project/ray · GitHub.

I will mark this thread as resolved.

Topic		Replies	Views
Autoscaler not shutting down idle nodes. ray 1.3 Ray Clusters	20	1322	June 9, 2021
Autoscaler not removing idle workers Ray Clusters	2	691	April 12, 2023
Ray worker nodes won't scale down Ray Core	2	381	November 30, 2023
Ray Cluster Not Scaling Down	7	733	May 4, 2023
Remote worker nodes only alive for 30 seconds Ray Clusters	7	1599	April 24, 2025

Question about the "cache_stopped_nodes" option in the cluster yaml file

Related topics