I have a 4 GPU single node and about 1.5 - 2 hours into training, I get this error:
[Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50966, OpType=BROADCAST, Timeout(ms)=3600000) ran for 3609559 milliseconds before timing out.
I have tried increasing the timeout and still get the same error. I noticed this thread with similar issues: TorchTrainer: Collective operation timeout: WorkNCCL - #2 by saivivek15, but I do not have NVLink on my node.
What other options are available?