Distributed pytorch on cluster

Hello,

I am trying to run workload based on below example code on baremetal cluster:

https://docs.ray.io/en/master/raysgd/raysgd_pytorch.html

I have below resources on my cluster:

Cluster status: 2021-06-03 15:57:58.357332
Node status

1 node(s) with resources: {‘node:10.187.57.59’: 1.0, ‘GPU’: 1.0, ‘object_store_memory’: 20000000000.0, ‘memory’: 504794141696.0, ‘CPU’: 32.0, ‘accelerator_type:T’: 1.0}
1 node(s) with resources: {‘object_store_memory’: 20000000000.0, ‘memory’: 514413107200.0, ‘CPU’: 1.0, ‘accelerator_type:T’: 1.0, ‘node:9.59.194.22’: 1.0}

Resources

Usage:
0.0/33.0 CPU
0.0/1.0 GPU
0.0/2.0 accelerator_type:T
0.00/949.211 GiB memory
0.00/37.253 GiB object_store_memory

Demands:
(no resource demands)

when I connect to the above existing cluster, none of the tasks are scheduled:

Running user workload: python pytorch_trainer_latest.py
2021-06-03 15:58:08,070 INFO worker.py:641 – Connecting to existing Ray cluster at address: X.X.194.59:1506
(pid=14703) 2021-06-03 15:58:26,207 INFO distributed_torch_runner.py:58 – Setting up process group for: tcp://10.187.57.59:40114 [rank=0]
2021-06-03 15:58:39,830 WARNING worker.py:1115 – The actor or task with ID ffffffffffffffffb7e8c2181be8376a05f1f54e01000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {31.000000/32.000000 CPU, 470.126180 GiB/470.126180 GiB memory, 0.000000/1.000000 GPU, 18.626451 GiB/18.626451 GiB object_store_memory, 1.000000/1.000000 accelerator_type:T, 1.000000/1.000000 node:10.187.57.59}
. In total there are 0 pending tasks and 4 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

can you please help, how can I run the example code?

thanks

Can you post the code of your TorchTrainer(...) call?

There was an issue with NCCL setup, I re-installed NCCL with command:

conda install pytorch torchvision -c pytorch

Hi @rliaw , I am running this example:

This example runs on 1 GPU, but when I have ray cluster with 2 GPUs I get the below message and the training never completes:

======== Cluster status: 2021-06-07 09:57:15.615773 ========
Node status

1 node(s) with resources: {‘CPU’: 32.0, ‘GPU’: 1.0, ‘accelerator_type:T’: 1.0, ‘memory’: 493954311168.0, ‘object_store_memory’: 20000000000.0, ‘node:10.187.57.22’: 1.0}
1 node(s) with resources: {‘CPU’: 32.0, ‘GPU’: 1.0, ‘accelerator_type:T’: 1.0, ‘memory’: 504701589504.0, ‘object_store_memory’: 20000000000.0, ‘node:9.59.194.177’: 1.0}

Resources

Usage:
0.0/64.0 CPU
0.0/2.0 GPU
0.0/2.0 accelerator_type:T
0.00/930.071 GiB memory
0.00/37.253 GiB object_store_memory

Demands:
(no resource demands)

Running user workload: python pytorch_trainer_latest.py --use-gpu
2021-06-07 09:57:28,646 INFO worker.py:641 – Connecting to existing Ray cluster at address: X.X9.194.22:31348
(pid=22508) 2021-06-07 09:57:41,709 INFO distributed_torch_runner.py:58 – Setting up process group for: tcp://10.187.57.22:53970 [rank=0]
(pid=6453, ip=10.187.57.177) 2021-06-07 09:57:53,839 INFO distributed_torch_runner.py:58 – Setting up process group for: tcp://10.187.57.22:53970 [rank=1]
2021-06-07 09:58:12,674 WARNING worker.py:1115 – The actor or task with ID ffffffffffffffff9785528d0f729743790a039c01000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {31.000000/32.000000 CPU, 470.039984 GiB/470.039984 GiB memory, 0.000000/1.000000 GPU, 18.626451 GiB/18.626451 GiB object_store_memory, 1.000000/1.000000 node:X.X9.194.177, 1.000000/1.000000 accelerator_type:T}
. In total there are 0 pending tasks and 3 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

Can you please suggest how to proceed with the training?

I was not setting num_workers correctly, the issue seems to be resolved, thanks for all your help.