AIR, TorchTrainer, DDP and NCCL Timeout

I have a 4 GPU single node and about 1.5 - 2 hours into training, I get this error:

[Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50966, OpType=BROADCAST, Timeout(ms)=3600000) ran for 3609559 milliseconds before timing out.

I have tried increasing the timeout and still get the same error. I noticed this thread with similar issues: TorchTrainer: Collective operation timeout: WorkNCCL - #2 by saivivek15, but I do not have NVLink on my node.

What other options are available?

Is the job training successfully up until this point? Is it hanging up until that point? Are you able to correlate the failure point (or the failure point - 3600000ms) to some event?

Indeed, it is training, reporting a decreasing loss, etc but it fails generally around that point. There is quite a bit of discussion about these types of errors in the Torch DDP community, but I am wondering if there tends to be an obvious thing to try or not.