Distributed pytorch on cluster

asm582 · June 3, 2021, 9:05pm

Hello,

I am trying to run workload based on below example code on baremetal cluster:

https://docs.ray.io/en/master/raysgd/raysgd_pytorch.html

I have below resources on my cluster:

Cluster status: 2021-06-03 15:57:58.357332
Node status

1 node(s) with resources: {‘node:10.187.57.59’: 1.0, ‘GPU’: 1.0, ‘object_store_memory’: 20000000000.0, ‘memory’: 504794141696.0, ‘CPU’: 32.0, ‘accelerator_type:T’: 1.0}
1 node(s) with resources: {‘object_store_memory’: 20000000000.0, ‘memory’: 514413107200.0, ‘CPU’: 1.0, ‘accelerator_type:T’: 1.0, ‘node:9.59.194.22’: 1.0}

Resources

Usage:
0.0/33.0 CPU
0.0/1.0 GPU
0.0/2.0 accelerator_type:T
0.00/949.211 GiB memory
0.00/37.253 GiB object_store_memory

Demands:
(no resource demands)

when I connect to the above existing cluster, none of the tasks are scheduled:

Running user workload: python pytorch_trainer_latest.py
2021-06-03 15:58:08,070 INFO worker.py:641 – Connecting to existing Ray cluster at address: X.X.194.59:1506
(pid=14703) 2021-06-03 15:58:26,207 INFO distributed_torch_runner.py:58 – Setting up process group for: tcp://10.187.57.59:40114 [rank=0]
2021-06-03 15:58:39,830 WARNING worker.py:1115 – The actor or task with ID ffffffffffffffffb7e8c2181be8376a05f1f54e01000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {31.000000/32.000000 CPU, 470.126180 GiB/470.126180 GiB memory, 0.000000/1.000000 GPU, 18.626451 GiB/18.626451 GiB object_store_memory, 1.000000/1.000000 accelerator_type:T, 1.000000/1.000000 node:10.187.57.59}
. In total there are 0 pending tasks and 4 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

can you please help, how can I run the example code?

thanks

rliaw · June 4, 2021, 7:42am

Can you post the code of your TorchTrainer(...) call?

asm582 · June 7, 2021, 2:10pm

There was an issue with NCCL setup, I re-installed NCCL with command:

conda install pytorch torchvision -c pytorch

asm582 · June 7, 2021, 3:00pm

Hi @rliaw , I am running this example:

github.com

ray-project/ray/blob/master/python/ray/util/sgd/torch/examples/train_example.py

"""Example code for RaySGD Torch in the documentation.

It ignores yapf because yapf doesn't allow comments right after code blocks,
but we put comments right after code blocks to prevent large white spaces
in the documentation.
"""

# yapf: disable
# __torch_train_example__
import argparse
import numpy as np
import torch
import torch.nn as nn

from ray.util.sgd import TorchTrainer
from ray.util.sgd.torch import TrainingOperator


class LinearDataset(torch.utils.data.Dataset):
    """y = a * x + b"""

This file has been truncated. show original

This example runs on 1 GPU, but when I have ray cluster with 2 GPUs I get the below message and the training never completes:

======== Cluster status: 2021-06-07 09:57:15.615773 ========
Node status

1 node(s) with resources: {‘CPU’: 32.0, ‘GPU’: 1.0, ‘accelerator_type:T’: 1.0, ‘memory’: 493954311168.0, ‘object_store_memory’: 20000000000.0, ‘node:10.187.57.22’: 1.0}
1 node(s) with resources: {‘CPU’: 32.0, ‘GPU’: 1.0, ‘accelerator_type:T’: 1.0, ‘memory’: 504701589504.0, ‘object_store_memory’: 20000000000.0, ‘node:9.59.194.177’: 1.0}

Resources

Usage:
0.0/64.0 CPU
0.0/2.0 GPU
0.0/2.0 accelerator_type:T
0.00/930.071 GiB memory
0.00/37.253 GiB object_store_memory

Demands:
(no resource demands)

Running user workload: python pytorch_trainer_latest.py --use-gpu
2021-06-07 09:57:28,646 INFO worker.py:641 – Connecting to existing Ray cluster at address: X.X9.194.22:31348
(pid=22508) 2021-06-07 09:57:41,709 INFO distributed_torch_runner.py:58 – Setting up process group for: tcp://10.187.57.22:53970 [rank=0]
(pid=6453, ip=10.187.57.177) 2021-06-07 09:57:53,839 INFO distributed_torch_runner.py:58 – Setting up process group for: tcp://10.187.57.22:53970 [rank=1]
2021-06-07 09:58:12,674 WARNING worker.py:1115 – The actor or task with ID ffffffffffffffff9785528d0f729743790a039c01000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {31.000000/32.000000 CPU, 470.039984 GiB/470.039984 GiB memory, 0.000000/1.000000 GPU, 18.626451 GiB/18.626451 GiB object_store_memory, 1.000000/1.000000 node:X.X9.194.177, 1.000000/1.000000 accelerator_type:T}
. In total there are 0 pending tasks and 3 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

Can you please suggest how to proceed with the training?

asm582 · June 9, 2021, 6:40pm

I was not setting num_workers correctly, the issue seems to be resolved, thanks for all your help.

Topic		Replies	Views
RaySGD example connection time out on 2-node cluster? Ray Tune	5	379	June 15, 2021
Any suggestions on how to debug the distributed torch trainer Dashboard, Monitoring & Debugging	7	875	June 9, 2021
Distributed Training & Distributed Tuning using Ray Tune, PLT, Ray Lightning Ray Clusters	1	376	April 25, 2022
Ray.tune with pytorch: only uses 1 of 4 GPUs	1	318	May 15, 2023
Errors when test TorchTrainer with the "getting started" code Ray Train	1	525	October 1, 2021

Distributed pytorch on cluster

Cluster status: 2021-06-03 15:57:58.357332 Node status

Resources

======== Cluster status: 2021-06-07 09:57:15.615773 ======== Node status

Resources

Related topics

Cluster status: 2021-06-03 15:57:58.357332
Node status

======== Cluster status: 2021-06-07 09:57:15.615773 ========
Node status