Is the job training successfully up until this point? Is it hanging up until that point? Are you able to correlate the failure point (or the failure point - 3600000ms) to some event?
Indeed, it is training, reporting a decreasing loss, etc but it fails generally around that point. There is quite a bit of discussion about these types of errors in the Torch DDP community, but I am wondering if there tends to be an obvious thing to try or not.