Error: No available node types can fulfill resource request {'node:10.0.0.9': 0.01} when autoresuming

RaymondK · September 5, 2022, 10:38am

When I resume locally with Tune at Windows 10 or with Docker in a Linux environment everything works fine. However, when an autoresuming (due to using spot instances) is happening on Azure I get the weird error:

Error: No available node types can fulfill resource request {'node:10.0.0.9': 0.01}. Add suitable node types to this cluster to resolve this issue.

I have no clue how to fix this or where this is coming from.
It seems that the previous resources are not freed up or something like this but since I have no control over when Azure autoresumes this so it is hard to test this or make a reproduction script. So any hint or best guess what this could be would be welcome.

Maybe someone knows what {‘node:10.0.0.9’: 0.01} could mean? Normally when a resource is requested something like {‘GPU’: 1, ‘CPU’: 4} is shown instead of this node format.

Any help is appreciated.

arturn · September 5, 2022, 7:45pm

Hi! Please post this question under the Ray Air forums.
The source of this issue is very unlikely to be rooted in RLlib so you’ll be better helped there.
Cheers

RaymondK · September 6, 2022, 7:23am

This makes sense. I didn’t think of it and since I am so heavily using RLlib that was my default category
Is just changing the category label by editing this post enough, like I did now?

matthewdeng · September 6, 2022, 3:32pm

Hey @RaymondK, the error seems to be due to some internal logic that is trying to schedule the actor/task with IP 10.0.0.9, which I’m presuming is the original instance which has been interrupted.

Could you share more of the output/trace prior to this error?

RaymondK · September 7, 2022, 8:25am

Yes, the error itself doesn’t really have a trace but I have collected some additional information that could be useful.

I have printed several things (from a different run so the exact node is different).
print(ray.nodes())
[{'NodeID': '6c3987f3218b11eee734260d04e1052cad5a5f346c710ac43026bbe2', 'Alive': True, 'NodeManagerAddress': '10.0.0.4', 'NodeManagerHostname': '4becae3a200e4a8e82feb3e63b37231100000C', 'NodeManagerPort': 44003, 'ObjectManagerPort': 36225, 'ObjectStoreSocketName': '/tmp/ray/session_2022-09-06_18-18-09_544524_21/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2022-09-06_18-18-09_544524_21/sockets/raylet', 'MetricsExportPort': 56375, 'NodeName': '10.0.0.4', 'alive': True, 'Resources': {'object_store_memory': 35206370918.0, 'accelerator_type:M60': 1.0, 'node:10.0.0.4': 1.0, 'memory': 72148198810.0, 'GPU': 1.0, 'CPU': 12.0}}]

print(ray.cluster_resources())
{'accelerator_type:M60': 1.0, 'GPU': 1.0, 'memory': 72143714509.0, 'CPU': 12.0, 'node:10.0.0.4': 1.0, 'object_store_memory': 35204449075.0}

print(ray.available_resources())
{'node:10.0.0.4': 1.0, 'object_store_memory': 35206370918.0, 'memory': 72148198810.0, 'CPU': 12.0, 'accelerator_type:M60': 1.0, 'GPU': 1.0}

And some autoresume information
2022-09-06 18:18:33,784 INFO trainable.py:668 -- Restored on 10.0.0.4 from checkpoint: /tmp/checkpoint_tmp_1k0rhtzs
2022-09-06 18:18:33,785 INFO trainable.py:677 -- Current state after restoring: {'_iteration': 90, '_timesteps_total': None, '_time_total': 3375.970132827759, '_episodes_total': 14337}

Then it sometimes does one or even a few training iterations before saying:
Error: No available node types can fulfill resource request {'node:10.0.0.10': 0.01}. Add suitable node types to this cluster to resolve this issue.

Note that the one it restores from 10.0.0.4 is different then then one it can’t fulfill with the request 10.0.0.10.

Maybe this is an indication that something is off?

Edit: In Azure you can also see the previous (killed) run which shows for the cluster resources, available resources and ray nodes, the node: 10.0.0.10.

Edit: I found out that the 10.0.0.x seems to stand for a private IP address in Azure. So my preliminary hypothesis is now that it runs on a certain node IP 10.0.0.A and when it resumes it does so on a new node IP 10.0.0.B which apparently lead to a resource request error as it wants to use the old IP.

RaymondK · September 8, 2022, 10:30am

I understand the issue better now and for clarity made a new post: Resuming to different node ip then original one doesn't work

Topic		Replies	Views
Resuming to different node ip then original one doesn't work Ray Tune	6	539	September 13, 2022
Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'GPU': 8.0, 'CPU': 8.0, 'priority': 16.0}) Debugging and performance tuning	1	20	May 1, 2025
No available node types to fulfill the request Ray Core	4	2168	December 2, 2022
Error: No available node types can fulfill resource request Ray Train	8	8894	March 21, 2022
No available node types can fulfill resource request Ray Clusters	0	720	April 6, 2023

Error: No available node types can fulfill resource request {'node:10.0.0.9': 0.01} when autoresuming

Related topics