AIR, TorchTrainer, DDP and NCCL Timeout

localh · August 16, 2023, 3:36pm

I have a 4 GPU single node and about 1.5 - 2 hours into training, I get this error:

[Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50966, OpType=BROADCAST, Timeout(ms)=3600000) ran for 3609559 milliseconds before timing out.

I have tried increasing the timeout and still get the same error. I noticed this thread with similar issues: TorchTrainer: Collective operation timeout: WorkNCCL - #2 by saivivek15, but I do not have NVLink on my node.

What other options are available?

matthewdeng · August 17, 2023, 5:29pm

Is the job training successfully up until this point? Is it hanging up until that point? Are you able to correlate the failure point (or the failure point - 3600000ms) to some event?

localh · August 17, 2023, 6:02pm

Indeed, it is training, reporting a decreasing loss, etc but it fails generally around that point. There is quite a bit of discussion about these types of errors in the Torch DDP community, but I am wondering if there tends to be an obvious thing to try or not.

Topic		Replies	Views
TorchTrainer: Collective operation timeout: WorkNCCL	2	1847	July 18, 2023
TorchTrainer hangs when only 1 worker raises error	15	1088	November 2, 2022
Any suggestions on how to debug the distributed torch trainer Dashboard, Monitoring & Debugging	7	901	June 9, 2021
Torch DDP backend performance issues Ray Tune	0	579	April 5, 2022
Hang when training across two machines	0	304	September 1, 2023

AIR, TorchTrainer, DDP and NCCL Timeout

Related topics