Yes, the error itself doesn’t really have a trace but I have collected some additional information that could be useful.
I have printed several things (from a different run so the exact node is different).
print(ray.nodes())
[{'NodeID': '6c3987f3218b11eee734260d04e1052cad5a5f346c710ac43026bbe2', 'Alive': True, 'NodeManagerAddress': '10.0.0.4', 'NodeManagerHostname': '4becae3a200e4a8e82feb3e63b37231100000C', 'NodeManagerPort': 44003, 'ObjectManagerPort': 36225, 'ObjectStoreSocketName': '/tmp/ray/session_2022-09-06_18-18-09_544524_21/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2022-09-06_18-18-09_544524_21/sockets/raylet', 'MetricsExportPort': 56375, 'NodeName': '10.0.0.4', 'alive': True, 'Resources': {'object_store_memory': 35206370918.0, 'accelerator_type:M60': 1.0, 'node:10.0.0.4': 1.0, 'memory': 72148198810.0, 'GPU': 1.0, 'CPU': 12.0}}]
print(ray.cluster_resources())
{'accelerator_type:M60': 1.0, 'GPU': 1.0, 'memory': 72143714509.0, 'CPU': 12.0, 'node:10.0.0.4': 1.0, 'object_store_memory': 35204449075.0}
print(ray.available_resources())
{'node:10.0.0.4': 1.0, 'object_store_memory': 35206370918.0, 'memory': 72148198810.0, 'CPU': 12.0, 'accelerator_type:M60': 1.0, 'GPU': 1.0}
And some autoresume information
2022-09-06 18:18:33,784 INFO trainable.py:668 -- Restored on 10.0.0.4 from checkpoint: /tmp/checkpoint_tmp_1k0rhtzs
2022-09-06 18:18:33,785 INFO trainable.py:677 -- Current state after restoring: {'_iteration': 90, '_timesteps_total': None, '_time_total': 3375.970132827759, '_episodes_total': 14337}
Then it sometimes does one or even a few training iterations before saying:
Error: No available node types can fulfill resource request {'node:10.0.0.10': 0.01}. Add suitable node types to this cluster to resolve this issue.
Note that the one it restores from 10.0.0.4
is different then then one it can’t fulfill with the request 10.0.0.10
.
Maybe this is an indication that something is off?
Edit: In Azure you can also see the previous (killed) run which shows for the cluster resources, available resources and ray nodes, the node: 10.0.0.10
.
Edit: I found out that the 10.0.0.x seems to stand for a private IP address in Azure. So my preliminary hypothesis is now that it runs on a certain node IP 10.0.0.A and when it resumes it does so on a new node IP 10.0.0.B which apparently lead to a resource request error as it wants to use the old IP.