No available node types to fulfill the request

blublinsky · October 8, 2022, 9:01pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

When no nodes can fulfill resource requests, Ray correctly returns an error, but it looks like the request is never being deleted. As a result, several such requests seem to clog the master, and eventually master can become not responsive. Am I missing something? Is it by design? Is there an option that can overwrite this behaviour and remove all requests that can not be fulfilled?

Chen_Shen · October 15, 2022, 6:38pm

hey, @blublinsky this sounds like a bug. The unfulfilled resources request shouldn’t block the new task being scheduled. Wondering if you have a repro script, or create a github issue?

FFede0 · October 17, 2022, 10:25am

Hi @Chen_Shen. Here’s a basic example that shows that behaviour:

import ray
import sys

@ray.remote
def f():
	print("I am function f")

if __name__ == "__main__":
	num_cpus = int(sys.argv[1])
	ray.init()
	ray.get([f.options(num_cpus=num_cpus).remote() for _ in range(4)])

Assuming that it is in a script called unfeasible_request.py, if you run it as:

python unfeasible_request.py 16

(for instance, or substituting 16 with any number of CPUs larger than what you have in the ray nodes), you observe the following:

the autoscaler prints an error:

(scheduler +3s) Error: No available node types can fulfill resource request {'CPU': 16.0}. Add suitable node types to this cluster to resolve this issue.

as expected.

executing “ray status”, you get:

======== Autoscaler status: 2022-10-17 12:08:53.809035 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_d34510a97d2b9c025bebd150d972b83b4aca6310beaef5b8340ab943
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/8.0 CPU
 0.00/17.913 GiB memory
 0.00/2.000 GiB object_store_memory

Demands:
 {'CPU': 16.0}: 4+ pending tasks/actors

The issue is that those two messages (the one from the autoscaler and the ray status output) are still printed after you kill the unfeasible_request.py script and even (if you run everything directly from python) after running ray.shutdown(), as soon as you reconnect to the cluster.

Apparently, the only way to remove the request completely is stopping and restarting the cluster (which however is not always feasible).

The new tasks are not blocked immediately, but this happens sometimes after you collect many unfeasible requests.

Chen_Shen · October 19, 2022, 5:22pm

Thanks, I tracked it as an issue [Core] Infeasible requests leaked even if the submitting job is canceled · Issue #29468 · ray-project/ray · GitHub here and hopefully get it fixed in next few weeks

zhz · December 2, 2022, 6:09am

Marking the above response as the resolution. Let’s track the Github issue instead

Topic		Replies	Views
How to resolve No available node types can fulfill resource Ray Clusters	0	1503	February 17, 2022
No available node types can fulfill resource request Ray Clusters	0	738	April 6, 2023
Error: No available node types can fulfill resource request Ray Train	8	9362	March 21, 2022
Error: No available node types can fulfill resource request {'node:10.0.0.9': 0.01} when autoresuming Ray Tune	5	1781	September 8, 2022
Subset of tasks stuck in "PENDING_NODE_ASSIGNMENT" forever Ray Clusters	9	2179	May 25, 2023

No available node types to fulfill the request

Related topics