Error: No available node types can fulfill resource request {'node:10.0.0.9': 0.01} when autoresuming

When I resume locally with Tune at Windows 10 or with Docker in a Linux environment everything works fine. However, when an autoresuming (due to using spot instances) is happening on Azure I get the weird error:

Error: No available node types can fulfill resource request {'node:10.0.0.9': 0.01}. Add suitable node types to this cluster to resolve this issue.

I have no clue how to fix this or where this is coming from.
It seems that the previous resources are not freed up or something like this but since I have no control over when Azure autoresumes this so it is hard to test this or make a reproduction script. So any hint or best guess what this could be would be welcome.

Maybe someone knows what {‘node:10.0.0.9’: 0.01} could mean? Normally when a resource is requested something like {‘GPU’: 1, ‘CPU’: 4} is shown instead of this node format.

Any help is appreciated.

Hi! Please post this question under the Ray Air forums.
The source of this issue is very unlikely to be rooted in RLlib so you’ll be better helped there.
Cheers

1 Like

This makes sense. I didn’t think of it and since I am so heavily using RLlib that was my default category :slight_smile:
Is just changing the category label by editing this post enough, like I did now?

Hey @RaymondK, the error seems to be due to some internal logic that is trying to schedule the actor/task with IP 10.0.0.9, which I’m presuming is the original instance which has been interrupted.

Could you share more of the output/trace prior to this error?

1 Like

Yes, the error itself doesn’t really have a trace but I have collected some additional information that could be useful.

I have printed several things (from a different run so the exact node is different).
print(ray.nodes())
[{'NodeID': '6c3987f3218b11eee734260d04e1052cad5a5f346c710ac43026bbe2', 'Alive': True, 'NodeManagerAddress': '10.0.0.4', 'NodeManagerHostname': '4becae3a200e4a8e82feb3e63b37231100000C', 'NodeManagerPort': 44003, 'ObjectManagerPort': 36225, 'ObjectStoreSocketName': '/tmp/ray/session_2022-09-06_18-18-09_544524_21/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2022-09-06_18-18-09_544524_21/sockets/raylet', 'MetricsExportPort': 56375, 'NodeName': '10.0.0.4', 'alive': True, 'Resources': {'object_store_memory': 35206370918.0, 'accelerator_type:M60': 1.0, 'node:10.0.0.4': 1.0, 'memory': 72148198810.0, 'GPU': 1.0, 'CPU': 12.0}}]

print(ray.cluster_resources())
{'accelerator_type:M60': 1.0, 'GPU': 1.0, 'memory': 72143714509.0, 'CPU': 12.0, 'node:10.0.0.4': 1.0, 'object_store_memory': 35204449075.0}

print(ray.available_resources())
{'node:10.0.0.4': 1.0, 'object_store_memory': 35206370918.0, 'memory': 72148198810.0, 'CPU': 12.0, 'accelerator_type:M60': 1.0, 'GPU': 1.0}

And some autoresume information
2022-09-06 18:18:33,784 INFO trainable.py:668 -- Restored on 10.0.0.4 from checkpoint: /tmp/checkpoint_tmp_1k0rhtzs
2022-09-06 18:18:33,785 INFO trainable.py:677 -- Current state after restoring: {'_iteration': 90, '_timesteps_total': None, '_time_total': 3375.970132827759, '_episodes_total': 14337}

Then it sometimes does one or even a few training iterations before saying:
Error: No available node types can fulfill resource request {'node:10.0.0.10': 0.01}. Add suitable node types to this cluster to resolve this issue.

Note that the one it restores from 10.0.0.4 is different then then one it can’t fulfill with the request 10.0.0.10.

Maybe this is an indication that something is off?

Edit: In Azure you can also see the previous (killed) run which shows for the cluster resources, available resources and ray nodes, the node: 10.0.0.10.

Edit: I found out that the 10.0.0.x seems to stand for a private IP address in Azure. So my preliminary hypothesis is now that it runs on a certain node IP 10.0.0.A and when it resumes it does so on a new node IP 10.0.0.B which apparently lead to a resource request error as it wants to use the old IP.

I understand the issue better now and for clarity made a new post: Resuming to different node ip then original one doesn't work