Hang when training across two machines

Li_Bin · September 1, 2023, 12:16am

I setup two nodes one 3090 and one 1080 to train with TorchTrainer but hang as below:

Wrapping provided model in DistributedDataParallel.

The problem is I can train the model on only each node but can’t run the job in the two nodes at the same time.

The code from the example " Running Distributed Training of a PyTorch Model on Fashion MNIST with Ray Train".

I notice there is a socket timeout error from nccl lib.

Any help will be appriecated .

LI

Topic		Replies	Views
TorchTrainer hangs when only 1 worker raises error	15	1045	November 2, 2022
Asking for help: the problem of error reporting caused by distributed use of AIR	3	614	August 22, 2023
Distributed pytorch on cluster Ray Clusters	4	542	June 9, 2021
TorchTrainer: Collective operation timeout: WorkNCCL	2	1720	July 18, 2023
Ray Train with DDP on multi-node set-up	2	795	September 11, 2024