Setup: I am using Ray Tune with resume="AUTO" on Azure (Linux) with Ray 2.0.0 and python 3.9. I am using spot instances, so sometimes Azure kills the node and restarts it (autoresume).
Problem: Azure gives the node a private IP address: 10.0.0.X. Then when Azure autoresumes, it gives the node a new IP address: ‘10.0.0.Y’. When this private IP address is the same (X==Y) then autoresume works correctly. When the newly assigned private IP address is different (X!=Y) then it will restore from node 10.0.0.Y and shortly after that gives the error: Error: No available node types can fulfill resource request {'node:10.0.0.X': 0.01}. Add suitable node types to this cluster to resolve this issue.
Controlling which IP address is assigned to the node in Azure is not a desired option.
Question: How can I make sure Ray Tune handles this situation correctly when the private IP address where it resumes from is different from then one it started the run from?
Thanks for the fast reply. Unfortunately, I am already using Ray 2.0.0 and still experience the bug.
Edit: Sorry for the confusion, I used my own Ray fork of the 2.0.0 release which was behind the official 2.0.0 release. I will test it and when I see the problem is away will mark it as a solution.
Just for context, the version “2.0.0dev0” was used for the latest master until about June this year, when 2.0.0 was used for the actual 2.0 release and 3.0.0dev0 is the new master branch. So 2.0.0dev0 can be anything between 1.5 years and 3 months old
I tried the plain Ray 2.0.0 (directly from pip and it has the check for Alive) and this gives the same error. I have checked and the function checkpoint = _get_checkpoint_from_remote_node( checkpoint_path, checkpoint_node_ip )
doesn’t get executed because in the function def restore( self, checkpoint_path: Union[str, Checkpoint], checkpoint_node_ip: Optional[str] = None, ): in ray.tune.trainable.trainable.py
has the checkpoint_node_ip is None and also because _maybe_load_from_cloud(checkpoint_path) is True.
Also if I remove the if statements in restore(...) so that _get_checkpoint_from_remote_node(...) gets executed nothing really changes.
I can reproduce this by making two docker containers each with a different ip address and let first run the first one. Then let the second one resume from a checkpoint of the first one.
Let me know if you have any ideas how to solve this or if I need to post a github issue with a minimal reproduction script.
If _maybe_load_from_cloud(checkpoint_path) evaluates to True (which incdictes that you use cloud checkpointing?), we shouldn’t run the _get_checkpoint_from_remote_node at all and thus shouldn’t run into the error.
It would be great if you could post an issue with a minimal reproducible example - thanks!