the timeout_s is set to 1800 second in ray/train/torch/config.py and cause this error. One can in principle define a “torch_config” as an input to the TorckTrainer but I couldn’t figure out how to do it. Right now, I just manually replace 1800 by a larger value. Any help on this is appreciated.
You can set this in the TorchTrainer
like so:
TorchTrainer(torch_config=TorchConfig(timeout_s=...))
However, if your application times out after 30 min (1800 sec) that usually indicates there’s an unwanted hang in the program that could be worth investigating.
Thanks. Extending the timeout actually solved my problem. It could be the parallel scheme I’m doing is not optimized At this stage, I will just let it be until it’s start to be unbearable.