No available node types to fulfill the request

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

When no nodes can fulfill resource requests, Ray correctly returns an error, but it looks like the request is never being deleted. As a result, several such requests seem to clog the master, and eventually master can become not responsive. Am I missing something? Is it by design? Is there an option that can overwrite this behaviour and remove all requests that can not be fulfilled?

hey, @blublinsky this sounds like a bug. The unfulfilled resources request shouldn’t block the new task being scheduled. Wondering if you have a repro script, or create a github issue?

Hi @Chen_Shen. Here’s a basic example that shows that behaviour:

import ray
import sys

def f():
	print("I am function f")

if __name__ == "__main__":
	num_cpus = int(sys.argv[1])
	ray.get([f.options(num_cpus=num_cpus).remote() for _ in range(4)])

Assuming that it is in a script called, if you run it as:

python 16

(for instance, or substituting 16 with any number of CPUs larger than what you have in the ray nodes), you observe the following:

  • the autoscaler prints an error:
(scheduler +3s) Error: No available node types can fulfill resource request {'CPU': 16.0}. Add suitable node types to this cluster to resolve this issue.

as expected.

  • executing “ray status”, you get:
======== Autoscaler status: 2022-10-17 12:08:53.809035 ========
Node status
 1 node_d34510a97d2b9c025bebd150d972b83b4aca6310beaef5b8340ab943
 (no pending nodes)
Recent failures:
 (no failures)

 0.0/8.0 CPU
 0.00/17.913 GiB memory
 0.00/2.000 GiB object_store_memory

 {'CPU': 16.0}: 4+ pending tasks/actors

The issue is that those two messages (the one from the autoscaler and the ray status output) are still printed after you kill the script and even (if you run everything directly from python) after running ray.shutdown(), as soon as you reconnect to the cluster.

Apparently, the only way to remove the request completely is stopping and restarting the cluster (which however is not always feasible).

The new tasks are not blocked immediately, but this happens sometimes after you collect many unfeasible requests.

1 Like

Thanks, I tracked it as an issue [Core] Infeasible requests leaked even if the submitting job is canceled · Issue #29468 · ray-project/ray · GitHub here and hopefully get it fixed in next few weeks

Marking the above response as the resolution. Let’s track the Github issue instead