How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I am trying out the mnist example from Ray Train: Distributed Deep Learning — Ray 1.11.0
I’m stuck with the message, Error: No available node types can fulfill resource request
, while my manually created cluster does have enough resources as shown in the ray status
below.
======== Autoscaler status: 2022-03-17 07:07:07.314583 ========
Node status
---------------------------------------------------------------
Healthy:
1 node_f1daa64a6cc101a788d809505aa3e4ae30388b547e6403bc96ccb0c7
1 node_f311886e3779057012dae9c50ba25aeddd79355ed4972ce70f7bafad
1 node_8504699f59ceea18865819a7afc4c8456085812a75d53cd281aab5a9
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/48.0 CPU (0.0 used of 8.0 reserved in placement groups)
0.0/4.0 GPU (0.0 used of 4.0 reserved in placement groups)
0.0/2.0 accelerator_type:V100
0.00/143.839 GiB memory
0.00/27.940 GiB object_store_memory
Demands:
{'GPU': 1.0, 'CPU': 8.0} * 4 (PACK): 1+ pending placement groups
{'CPU': 1.0, 'cpu': 8.0, 'gpu': 1.0} * 4 (PACK): 1+ pending placement groups
{'GPU': 1.0, 'CPU': 1.0} * 4 (PACK): 1+ pending placement groups
- Why is
{'GPU': 1.0, 'CPU': 8.0} * 4 (PACK): 1+ pending placement groups
pending?
I have 3 nodes,
a) 16 CPUs and 0 GPUs
b) 16 CPUs and 2 Nvidia V100 GPUs
b) 16 CPUs and 2 Nvidia V100 GPUs
I used the below code to launch the training
from ray.train import Trainer
trainer = Trainer(backend="tensorflow", num_workers=4, resources_per_worker={"GPU": 1, "CPU": 8}, use_gpu=True)
trainer.start()
results = trainer.run(train_func_distributed)
# trainer.shutdown()
- And does
trainer.shutdown()
teardown my manually created cluster?
I’m using ray==1.10.0 on all my nodes with python==3.7.10