- High: It blocks me to complete my task.
I setup two nodes one 3090 and one 1080 to train with TorchTrainer but hang as below:
Wrapping provided model in DistributedDataParallel.
The problem is I can train the model on only each node but can’t run the job in the two nodes at the same time.
The code from the example " Running Distributed Training of a PyTorch Model on Fashion MNIST with Ray Train".
I notice there is a socket timeout error from nccl lib.
Any help will be appriecated .