Hello,
I am trying to run workload based on below example code on baremetal cluster:
https://docs.ray.io/en/master/raysgd/raysgd_pytorch.html
I have below resources on my cluster:
Cluster status: 2021-06-03 15:57:58.357332
Node status
1 node(s) with resources: {‘node:10.187.57.59’: 1.0, ‘GPU’: 1.0, ‘object_store_memory’: 20000000000.0, ‘memory’: 504794141696.0, ‘CPU’: 32.0, ‘accelerator_type:T’: 1.0}
1 node(s) with resources: {‘object_store_memory’: 20000000000.0, ‘memory’: 514413107200.0, ‘CPU’: 1.0, ‘accelerator_type:T’: 1.0, ‘node:9.59.194.22’: 1.0}
Resources
Usage:
0.0/33.0 CPU
0.0/1.0 GPU
0.0/2.0 accelerator_type:T
0.00/949.211 GiB memory
0.00/37.253 GiB object_store_memory
Demands:
(no resource demands)
when I connect to the above existing cluster, none of the tasks are scheduled:
Running user workload: python pytorch_trainer_latest.py
2021-06-03 15:58:08,070 INFO worker.py:641 – Connecting to existing Ray cluster at address: X.X.194.59:1506
(pid=14703) 2021-06-03 15:58:26,207 INFO distributed_torch_runner.py:58 – Setting up process group for: tcp://10.187.57.59:40114 [rank=0]
2021-06-03 15:58:39,830 WARNING worker.py:1115 – The actor or task with ID ffffffffffffffffb7e8c2181be8376a05f1f54e01000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {31.000000/32.000000 CPU, 470.126180 GiB/470.126180 GiB memory, 0.000000/1.000000 GPU, 18.626451 GiB/18.626451 GiB object_store_memory, 1.000000/1.000000 accelerator_type:T, 1.000000/1.000000 node:10.187.57.59}
. In total there are 0 pending tasks and 4 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
can you please help, how can I run the example code?
thanks