Error: No available node types can fulfill resource request

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am trying out the mnist example from Ray Train: Distributed Deep Learning — Ray 1.11.0

I’m stuck with the message, Error: No available node types can fulfill resource request, while my manually created cluster does have enough resources as shown in the ray status below.

======== Autoscaler status: 2022-03-17 07:07:07.314583 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_f1daa64a6cc101a788d809505aa3e4ae30388b547e6403bc96ccb0c7
 1 node_f311886e3779057012dae9c50ba25aeddd79355ed4972ce70f7bafad
 1 node_8504699f59ceea18865819a7afc4c8456085812a75d53cd281aab5a9
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/48.0 CPU (0.0 used of 8.0 reserved in placement groups)
 0.0/4.0 GPU (0.0 used of 4.0 reserved in placement groups)
 0.0/2.0 accelerator_type:V100
 0.00/143.839 GiB memory
 0.00/27.940 GiB object_store_memory

Demands:
 {'GPU': 1.0, 'CPU': 8.0} * 4 (PACK): 1+ pending placement groups
 {'CPU': 1.0, 'cpu': 8.0, 'gpu': 1.0} * 4 (PACK): 1+ pending placement groups
 {'GPU': 1.0, 'CPU': 1.0} * 4 (PACK): 1+ pending placement groups
  1. Why is {'GPU': 1.0, 'CPU': 8.0} * 4 (PACK): 1+ pending placement groups pending?

I have 3 nodes,
a) 16 CPUs and 0 GPUs
b) 16 CPUs and 2 Nvidia V100 GPUs
b) 16 CPUs and 2 Nvidia V100 GPUs

I used the below code to launch the training

    from ray.train import Trainer

    trainer = Trainer(backend="tensorflow", num_workers=4, resources_per_worker={"GPU": 1, "CPU": 8}, use_gpu=True)
    trainer.start()
    results = trainer.run(train_func_distributed)
    # trainer.shutdown()
  1. And does trainer.shutdown() teardown my manually created cluster?

I’m using ray==1.10.0 on all my nodes with python==3.7.10

I recreated the ray head and workers and it seems to work. But if train_func_distributed() fails for any reason, subsequent calls to

trainer.start()
results = trainer.run(train_func_distributed)

timeout with

RayTaskError(TimeoutError): ray::BackendExecutor.start() (pid=1305, ip=100.96.197.8, repr=<ray.train.backend.BackendExecutor object at 0x7f10929e4710>)
  File "/home/jobuser/.local/lib/python3.7/site-packages/ray/train/backend.py", line 153, in start
    self._create_placement_group()
  File "/home/jobuser/.local/lib/python3.7/site-packages/ray/train/backend.py", line 231, in _create_placement_group
    placement_group.bundle_specs))
TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'CPU': 32.0, 'object_store_memory': 30000000000.0, 'node:100.97.80.164': 1.0, 'memory': 153690030285.0, 'GPU_group_1_096a9396af73a519d0323f066c57cb51': 1.0, 'bundle_group_0_096a9396af73a519d0323f066c57cb51': 1000.0, 'node:100.96.173.48': 1.0, 'accelerator_type:V100': 2.0, 'bundle_group_096a9396af73a519d0323f066c57cb51': 4000.0, 'CPU_group_1_096a9396af73a519d0323f066c57cb51': 4.0, 'GPU_group_0_096a9396af73a519d0323f066c57cb51': 1.0, 'CPU_group_0_096a9396af73a519d0323f066c57cb51': 4.0, 'bundle_group_1_096a9396af73a519d0323f066c57cb51': 1000.0, 'CPU_group_2_096a9396af73a519d0323f066c57cb51': 4.0, 'GPU_group_3_096a9396af73a519d0323f066c57cb51': 1.0, 'CPU_group_3_096a9396af73a519d0323f066c57cb51': 4.0, 'bundle_group_2_096a9396af73a519d0323f066c57cb51': 1000.0, 'node:100.96.197.8': 1.0, 'GPU_group_2_096a9396af73a519d0323f066c57cb51': 1.0, 'bundle_group_3_096a9396af73a519d0323f066c57cb51': 1000.0}, resources requested by the placement group: [{'GPU': 1.0, 'CPU': 4.0}, {'GPU': 1.0, 'CPU': 4.0}, {'GPU': 1.0, 'CPU': 4.0}]

and this request is not cleared from ray status. How do I clear this Demand from ray status so my subsequent training requests go through?

1 Like

For the failures you encountered, could instead try running trainer.shutdown() prior to rerunning trainer.start() and trainer.run()? This should clear out the previously allocated resources from the failed run.

And does trainer.shutdown() teardown my manually created cluster?

Nope, it shouldn’t affect the cluster itself. Instead, it will clean up your distributed training workers and make the resources available again.

1 Like

trainer.shutdown() throws this error when called after the TimeoutError I pasted above.

---------------------------------------------------------------------------
RayTaskError(InactiveWorkerGroupError)    Traceback (most recent call last)
/tmp/ipykernel_229/4170139357.py in <module>
----> 1 trainer.shutdown()

~/.local/lib/python3.7/site-packages/ray/train/trainer.py in shutdown(self)
    449     def shutdown(self):
    450         """Shuts down the training execution service."""
--> 451         ray.get(self._backend_executor_actor.shutdown.remote())
    452 
    453     def to_tune_trainable(

~/.local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py in wrapper(*args, **kwargs)
    103             if func.__name__ != "init" or is_client_mode_enabled_by_default:
    104                 return getattr(ray, func.__name__)(*args, **kwargs)
--> 105         return func(*args, **kwargs)
    106 
    107     return wrapper

~/.local/lib/python3.7/site-packages/ray/worker.py in get(object_refs, timeout)
   1731                     worker.core_worker.dump_object_store_memory_usage()
   1732                 if isinstance(value, RayTaskError):
-> 1733                     raise value.as_instanceof_cause()
   1734                 else:
   1735                     raise value

RayTaskError(InactiveWorkerGroupError): ray::BackendExecutor.shutdown() (pid=1305, ip=100.96.197.8, repr=<ray.train.backend.BackendExecutor object at 0x7f10929e4710>)
  File "/home/jobuser/.local/lib/python3.7/site-packages/ray/train/backend.py", line 547, in shutdown
    self.worker_group.shutdown()
  File "/home/jobuser/.local/lib/python3.7/site-packages/ray/train/backend.py", line 603, in __getattr__
    raise InactiveWorkerGroupError()
ray.train.backend.InactiveWorkerGroupError

Not sure then how to clear the existing demands on trainer.run failures

Hey @Nitin_Pasumarthy ,

Do you have a simple repro for this issue? If trainer.shutdown() raises the InactiveWorkerGroupError, then you should not see any of the resources being requested/reserved in ray status anymore.

Hopefully you can see the outputs as well.

Thanks for sharing the repro!

How do I clear this Demand from ray status so my subsequent training requests go through?

For this particular question, subsequent requests should still be able to go through - see the simple example below:

>>> from ray.train import Trainer
>>> t = Trainer(backend="torch", num_workers=20)
2022-03-20 15:15:09,645	INFO services.py:1462 -- View the Ray dashboard at http://127.0.0.1:8265
2022-03-20 15:15:12,055	INFO trainer.py:223 -- Trainer logs will be logged in: /Users/matt/ray_results/train_2022-03-20_15-15-12
>>> t.start()
^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/matt/workspace/ray/python/ray/train/trainer.py", line 263, in start
    self._backend_executor.start(initialization_hook)
  File "/Users/matt/workspace/ray/python/ray/train/utils.py", line 173, in <lambda>
    return lambda *args, **kwargs: ray.get(actor_method.remote(*args, **kwargs))
  File "/Users/matt/workspace/ray/python/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/Users/matt/workspace/ray/python/ray/worker.py", line 1793, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/Users/matt/workspace/ray/python/ray/worker.py", line 362, in get_objects
    object_refs, self.current_task_id, timeout_ms
  File "python/ray/_raylet.pyx", line 1198, in ray._raylet.CoreWorker.get_objects
  File "python/ray/_raylet.pyx", line 167, in ray._raylet.check_status
KeyboardInterrupt
>>> t = Trainer(backend="torch", num_workers=2)
2022-03-20 15:15:46,701	INFO trainer.py:223 -- Trainer logs will be logged in: /Users/matt/ray_results/train_2022-03-20_15-15-46
>>> t.start()
>>> (BaseWorkerMixin pid=13258) 2022-03-20 15:15:51,889	INFO torch.py:335 -- Setting up process group for: env:// [rank=0, world_size=2]
(BaseWorkerMixin pid=13259) 2022-03-20 15:15:51,890	INFO torch.py:335 -- Setting up process group for: env:// [rank=1, world_size=2]
ray status
======== Autoscaler status: 2022-03-20 15:15:54.097826 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_f886735e767b5511ec1b4f02170909892613622b5a9cc8118402f255
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 2.0/16.0 CPU (2.0 used of 2.0 reserved in placement groups)
 0.00/27.536 GiB memory
 0.00/2.000 GiB object_store_memory

Demands:
 {'CPU': 1.0} * 20 (PACK): 1+ pending placement groups

Not sure then how to clear the existing demands on trainer.run failures

Hey @sangcho, is there a way to clear placement group requests other than calling remove_placement_group? For reference, the creating actor has already been dereferenced and the placement group request still shows in ray status.

Nice idea! So we create a new trainer instance and let the other get garbage collected.