Resuming to different node ip then original one doesn't work

  • High: It blocks me to complete my task.

Setup: I am using Ray Tune with resume="AUTO" on Azure (Linux) with Ray 2.0.0 and python 3.9. I am using spot instances, so sometimes Azure kills the node and restarts it (autoresume).

Problem: Azure gives the node a private IP address: 10.0.0.X. Then when Azure autoresumes, it gives the node a new IP address: ‘10.0.0.Y’. When this private IP address is the same (X==Y) then autoresume works correctly. When the newly assigned private IP address is different (X!=Y) then it will restore from node 10.0.0.Y and shortly after that gives the error:
Error: No available node types can fulfill resource request {'node:10.0.0.X': 0.01}. Add suitable node types to this cluster to resolve this issue.
Controlling which IP address is assigned to the node in Azure is not a desired option.

Question: How can I make sure Ray Tune handles this situation correctly when the private IP address where it resumes from is different from then one it started the run from?

Hi @RaymondK,

this is a bug that has been resolved in Ray 2.0.0 - can you upgrade your Ray version?

Thanks for the fast reply. Unfortunately, I am already using Ray 2.0.0 and still experience the bug.

Edit: Sorry for the confusion, I used my own Ray fork of the 2.0.0 release which was behind the official 2.0.0 release. I will test it and when I see the problem is away will mark it as a solution.

Sounds good! This is the relevant change you should be looking for: ray/util.py at master · ray-project/ray · GitHub

(i.e. the check for Alive).

Just for context, the version “2.0.0dev0” was used for the latest master until about June this year, when 2.0.0 was used for the actual 2.0 release and 3.0.0dev0 is the new master branch. So 2.0.0dev0 can be anything between 1.5 years and 3 months old :slight_smile:

Unfortunately the error is still there.

I tried the plain Ray 2.0.0 (directly from pip and it has the check for Alive) and this gives the same error. I have checked and the function checkpoint = _get_checkpoint_from_remote_node( checkpoint_path, checkpoint_node_ip )
doesn’t get executed because in the function def restore( self, checkpoint_path: Union[str, Checkpoint], checkpoint_node_ip: Optional[str] = None, ): in ray.tune.trainable.trainable.py
has the checkpoint_node_ip is None and also because _maybe_load_from_cloud(checkpoint_path) is True.

Also if I remove the if statements in restore(...) so that _get_checkpoint_from_remote_node(...) gets executed nothing really changes.

I can reproduce this by making two docker containers each with a different ip address and let first run the first one. Then let the second one resume from a checkpoint of the first one.

Let me know if you have any ideas how to solve this or if I need to post a github issue with a minimal reproduction script.

If _maybe_load_from_cloud(checkpoint_path) evaluates to True (which incdictes that you use cloud checkpointing?), we shouldn’t run the _get_checkpoint_from_remote_node at all and thus shouldn’t run into the error.

It would be great if you could post an issue with a minimal reproducible example - thanks!

I posted an issue on Github: [Core] [Tune] Resuming to different node ip then original one doesn’t work · Issue #28468 · ray-project/ray · GitHub
Let me know if you need any more information.
Since it is the one of the last crucial issues that keeps me from properly using Ray I am happy to help out in any way I can.
And things like workarounds are also helpful.