Resuming to different node ip then original one doesn't work

RaymondK · September 8, 2022, 10:30am

High: It blocks me to complete my task.

Setup: I am using Ray Tune with resume="AUTO" on Azure (Linux) with Ray 2.0.0 and python 3.9. I am using spot instances, so sometimes Azure kills the node and restarts it (autoresume).

Problem: Azure gives the node a private IP address: 10.0.0.X. Then when Azure autoresumes, it gives the node a new IP address: ‘10.0.0.Y’. When this private IP address is the same (X==Y) then autoresume works correctly. When the newly assigned private IP address is different (X!=Y) then it will restore from node 10.0.0.Y and shortly after that gives the error:
Error: No available node types can fulfill resource request {'node:10.0.0.X': 0.01}. Add suitable node types to this cluster to resolve this issue.
Controlling which IP address is assigned to the node in Azure is not a desired option.

Question: How can I make sure Ray Tune handles this situation correctly when the private IP address where it resumes from is different from then one it started the run from?

kai · September 8, 2022, 12:35pm

Hi @RaymondK,

this is a bug that has been resolved in Ray 2.0.0 - can you upgrade your Ray version?

RaymondK · September 8, 2022, 12:49pm

Thanks for the fast reply. Unfortunately, I am already using Ray 2.0.0 and still experience the bug.

Edit: Sorry for the confusion, I used my own Ray fork of the 2.0.0 release which was behind the official 2.0.0 release. I will test it and when I see the problem is away will mark it as a solution.

kai · September 9, 2022, 8:51am

Sounds good! This is the relevant change you should be looking for: ray/util.py at master · ray-project/ray · GitHub

(i.e. the check for Alive).

Just for context, the version “2.0.0dev0” was used for the latest master until about June this year, when 2.0.0 was used for the actual 2.0 release and 3.0.0dev0 is the new master branch. So 2.0.0dev0 can be anything between 1.5 years and 3 months old

RaymondK · September 12, 2022, 11:35am

Unfortunately the error is still there.

I tried the plain Ray 2.0.0 (directly from pip and it has the check for Alive) and this gives the same error. I have checked and the function checkpoint = _get_checkpoint_from_remote_node( checkpoint_path, checkpoint_node_ip )
doesn’t get executed because in the function def restore( self, checkpoint_path: Union[str, Checkpoint], checkpoint_node_ip: Optional[str] = None, ): in ray.tune.trainable.trainable.py
has the checkpoint_node_ip is None and also because _maybe_load_from_cloud(checkpoint_path) is True.

Also if I remove the if statements in restore(...) so that _get_checkpoint_from_remote_node(...) gets executed nothing really changes.

I can reproduce this by making two docker containers each with a different ip address and let first run the first one. Then let the second one resume from a checkpoint of the first one.

Let me know if you have any ideas how to solve this or if I need to post a github issue with a minimal reproduction script.

kai · September 12, 2022, 1:05pm

If _maybe_load_from_cloud(checkpoint_path) evaluates to True (which incdictes that you use cloud checkpointing?), we shouldn’t run the _get_checkpoint_from_remote_node at all and thus shouldn’t run into the error.

It would be great if you could post an issue with a minimal reproducible example - thanks!

RaymondK · September 13, 2022, 12:34pm

I posted an issue on Github: [Core] [Tune] Resuming to different node ip then original one doesn’t work · Issue #28468 · ray-project/ray · GitHub
Let me know if you need any more information.
Since it is the one of the last crucial issues that keeps me from properly using Ray I am happy to help out in any way I can.
And things like workarounds are also helpful.

Topic		Replies	Views
Error: No available node types can fulfill resource request {'node:10.0.0.9': 0.01} when autoresuming Ray Tune	5	1716	September 8, 2022
Node_ip_address.json not found Ray Core	2	1338	December 8, 2023
Can a paused trial be restored on another node? Ray Tune	1	261	January 25, 2021
FileNotFoundError when resuming from Checkpoint Ray Tune	4	1296	August 11, 2022
[Clusters] Node ip not respected for head node? Ray Clusters	1	375	April 11, 2021

Resuming to different node ip then original one doesn't work

Related topics