Hang when training across two machines

  • High: It blocks me to complete my task.

I setup two nodes one 3090 and one 1080 to train with TorchTrainer but hang as below:

Wrapping provided model in DistributedDataParallel.

The problem is I can train the model on only each node but can’t run the job in the two nodes at the same time.

The code from the example " Running Distributed Training of a PyTorch Model on Fashion MNIST with Ray Train".

I notice there is a socket timeout error from nccl lib.

Any help will be appriecated .

LI