How severe does this issue affect your experience of using Ray?
- None: Just asking a question out of curiosity
I am setting up a ray cluster on Azure.
There are 2 options in the yaml file, about which I am wondering:
I have 2 observations:
- When I have
idle_timeout_minutes=1, then after roughly one minute of not being used the worker nodes are deallocated. This is what I expected. The same happens on
ray down cluster.yaml, everything is deallocated rather fast.
- When I have
idle_timeout_minutes=1, the removal of the nodes takes much longer (roughly 5-10 minutes). Also, I don’t see any indication in Azure that the nodes are about to be deleted. The same is true for
ray down cluster.yaml, when the head is already deleted and no longer listed in Azure, I still see the workers there, and they have
Running status. They are removed after a while though.
My question is: How are the workers actually deleted and why does it take so long for the second option?
@Dmitri hey, could you take a look at this?
I’m going to tag in one of the maintainers of the Azure Node Provider – cc @gramhagen .
I can also provide code references. The autoscaler interacts with cloud providers via a Python interface called NodeProvider. The relevant piece of code is the Azure implementation of the interface. The method the handles node deletion is
terminate_node. I do see in that method’s code that it wait for deletion of the virtual machine ray/node_provider.py at fa32cb7c401f5afa9a7cfb160c22aa62578af1ee · ray-project/ray · GitHub
I wonder if there’s any way to handle that deletion asynchronously?
thanks for the answer.
When I run
ray down cluster.yaml after a moment ray is actually done with it and claims that the nodes were deleted.
However, in the Azure portal they are still there. If I wait a few minutes, they are gone.
Is this just something Azure internal? If ray waits until the node actually is deleted this should not happen, right?
I see, that is a little odd. It may be an Azure-internal issue.
I guess de-provisioning a VM is a multi-stage process. Maybe, once ray down exits, the VMs are in some sense “marked for deletion” and it takes Azure a while to transition the status and eliminate the VM.
It’s not obvious to me precisely what is being waited for in the API call I linked above.
It’s possible there’s a way to avoid this by changing the implementation or using the Azure APIs differently – I’m not familiar with these APIs though.
That wait function might need a timeout input if it’s not actually waiting for deletion to occur. I was assuming it would wait indefinitely as is.
Anything that should be changed in that terminate_node method?
@gramhagen is this a bug? I can create a bug report on behalf of @M_S.
yeah i guess this is a bug. so your expectation is this call will block until the vm is deleted? I think that is what was intended as well. personally I’d prefer to let deletion happen in the backgroun since it can take a few minutes, but it’s probably better to match the same behavior across clouds, and we can add support for a timeout if we need to where it continues regardless of the current state of deletion.
@gramhagen I think the expectation is that when
cache_stopped_nodes=False the node provider does not wait for VM termination. Right now the experience is that
cache_stopped_nodes=True means very fast deallocation of the node (since it’s just stopping it), but
cache_stopped_nodes=False waits for the node to completely terminate, which takes more time.
I gather from reading the GCP node provider that they also block until termination completes . But AWS terminates  the instance when
cache_stopped_nodes=False and doesn’t wait for the VMs to finish termination.
So the question is, which is a less surprising experience for users? It seems to me that blocking on termination is more surprising than what AWS does, but open to other opinions.
 GCP node provider delete call, which blocks by default. ray/node.py at fa32cb7c401f5afa9a7cfb160c22aa62578af1ee · ray-project/ray · GitHub
 AWS node provider terminate call, which doesn’t block. ray/node_provider.py at fa32cb7c401f5afa9a7cfb160c22aa62578af1ee · ray-project/ray · GitHub
I also think it’s more natural to return immediately – as long as the cloud provider records and makes visible the “terminating” status very soon after the API call.
So, no change needed? I don’t think we have node caching implemented though.
i’m not even sure it’s actually waiting since there’s no value supplied to the wait method =)
Hi @gramhagen, do you have time to discuss offline? I am new to the Azure SDK so chances are high that I am missing something. I traced the
delete(...).wait() call in ray-project/ray to the Azure SDK and it looks like by default it invokes
thread.join(timeout), which Python docs says will block if
None. This means to me that the
wait call will block until the delete completes, which causes the behavior which is surprising and is the source of this Discuss question.
thread.join call: azure-sdk-for-python/_poller.py at main · Azure/azure-sdk-for-python · GitHub
LROPoller object: azure-sdk-for-python/_virtual_machines_operations.py at main · Azure/azure-sdk-for-python · GitHub
no, this sounds right to me, i’m not super familiar with the azure sdk either and was just interpreting from the docs. I agree the right change is just to drop the wait after deallocation.