Question about the "cache_stopped_nodes" option in the cluster yaml file

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

I am setting up a ray cluster on Azure.
There are 2 options in the yaml file, about which I am wondering:
cache_stopped_nodes
and
idle_timeout_minutes

I have 2 observations:

  1. When I have cache_stopped_nodes=True and idle_timeout_minutes=1, then after roughly one minute of not being used the worker nodes are deallocated. This is what I expected. The same happens on ray down cluster.yaml, everything is deallocated rather fast.
  2. When I have cache_stopped_nodes=False and idle_timeout_minutes=1, the removal of the nodes takes much longer (roughly 5-10 minutes). Also, I don’t see any indication in Azure that the nodes are about to be deleted. The same is true for ray down cluster.yaml, when the head is already deleted and no longer listed in Azure, I still see the workers there, and they have Running status. They are removed after a while though.

My question is: How are the workers actually deleted and why does it take so long for the second option?

Thank you!

@Dmitri hey, could you take a look at this?

I’m going to tag in one of the maintainers of the Azure Node Provider – cc @gramhagen .

I can also provide code references. The autoscaler interacts with cloud providers via a Python interface called NodeProvider. The relevant piece of code is the Azure implementation of the interface. The method the handles node deletion is terminate_node. I do see in that method’s code that it wait for deletion of the virtual machine ray/node_provider.py at fa32cb7c401f5afa9a7cfb160c22aa62578af1ee · ray-project/ray · GitHub

I wonder if there’s any way to handle that deletion asynchronously?

Hi there,

thanks for the answer.
When I run ray down cluster.yaml after a moment ray is actually done with it and claims that the nodes were deleted.
However, in the Azure portal they are still there. If I wait a few minutes, they are gone.
Is this just something Azure internal? If ray waits until the node actually is deleted this should not happen, right?

Thanks!

I see, that is a little odd. It may be an Azure-internal issue.

I guess de-provisioning a VM is a multi-stage process. Maybe, once ray down exits, the VMs are in some sense “marked for deletion” and it takes Azure a while to transition the status and eliminate the VM.
It’s not obvious to me precisely what is being waited for in the API call I linked above.

It’s possible there’s a way to avoid this by changing the implementation or using the Azure APIs differently – I’m not familiar with these APIs though.

That wait function might need a timeout input if it’s not actually waiting for deletion to occur. I was assuming it would wait indefinitely as is.

Anything that should be changed in that terminate_node method?

@gramhagen is this a bug? I can create a bug report on behalf of @M_S.

yeah i guess this is a bug. so your expectation is this call will block until the vm is deleted? I think that is what was intended as well. personally I’d prefer to let deletion happen in the backgroun since it can take a few minutes, but it’s probably better to match the same behavior across clouds, and we can add support for a timeout if we need to where it continues regardless of the current state of deletion.

@gramhagen I think the expectation is that when cache_stopped_nodes=False the node provider does not wait for VM termination. Right now the experience is that cache_stopped_nodes=True means very fast deallocation of the node (since it’s just stopping it), but cache_stopped_nodes=False waits for the node to completely terminate, which takes more time.

I gather from reading the GCP node provider that they also block until termination completes [1]. But AWS terminates [2] the instance when cache_stopped_nodes=False and doesn’t wait for the VMs to finish termination.

So the question is, which is a less surprising experience for users? It seems to me that blocking on termination is more surprising than what AWS does, but open to other opinions.

cc @Dmitri

[1] GCP node provider delete call, which blocks by default. ray/node.py at fa32cb7c401f5afa9a7cfb160c22aa62578af1ee · ray-project/ray · GitHub
[2] AWS node provider terminate call, which doesn’t block. ray/node_provider.py at fa32cb7c401f5afa9a7cfb160c22aa62578af1ee · ray-project/ray · GitHub

I also think it’s more natural to return immediately – as long as the cloud provider records and makes visible the “terminating” status very soon after the API call.

1 Like

So, no change needed? I don’t think we have node caching implemented though.

The change that’s needed is to remove the wait on this line ray/node_provider.py at fa32cb7c401f5afa9a7cfb160c22aa62578af1ee · ray-project/ray · GitHub

i’m not even sure it’s actually waiting since there’s no value supplied to the wait method =)

Hi @gramhagen, do you have time to discuss offline? I am new to the Azure SDK so chances are high that I am missing something. I traced the delete(...).wait() call in ray-project/ray to the Azure SDK and it looks like by default it invokes thread.join(timeout), which Python docs says will block if timeout is None. This means to me that the wait call will block until the delete completes, which causes the behavior which is surprising and is the source of this Discuss question.

thread.join call: azure-sdk-for-python/_poller.py at main · Azure/azure-sdk-for-python · GitHub

Instantiation of LROPoller object: azure-sdk-for-python/_virtual_machines_operations.py at main · Azure/azure-sdk-for-python · GitHub

no, this sounds right to me, i’m not super familiar with the azure sdk either and was just interpreting from the docs. I agree the right change is just to drop the wait after deallocation.

Ok, I created and tagged you in the GitHub bug report [Azure cluster launcher] Deletion takes ~10 minutes without stopped node caching, fast with stopped node caching · Issue #25971 · ray-project/ray · GitHub.

I will mark this thread as resolved.