TorchTrainer Timed out waiting 1800000 ms for send operation to complete

wxie2013 · October 9, 2024, 2:38pm

the timeout_s is set to 1800 second in ray/train/torch/config.py and cause this error. One can in principle define a “torch_config” as an input to the TorckTrainer but I couldn’t figure out how to do it. Right now, I just manually replace 1800 by a larger value. Any help on this is appreciated.

matthewdeng · October 9, 2024, 4:14pm

You can set this in the TorchTrainer like so:

TorchTrainer(torch_config=TorchConfig(timeout_s=...))

However, if your application times out after 30 min (1800 sec) that usually indicates there’s an unwanted hang in the program that could be worth investigating.

wxie2013 · October 10, 2024, 4:09pm

Thanks. Extending the timeout actually solved my problem. It could be the parallel scheme I’m doing is not optimized At this stage, I will just let it be until it’s start to be unbearable.

Topic		Replies	Views
Get distributed process group timeout when using torch trainer + FullSyncIterDatapipe Ray Train	5	694	December 20, 2022
Ray Streaming is timing out while trying to get next window	0	132	November 8, 2023
TorchTrainer hangs when only 1 worker raises error	15	1018	November 2, 2022
Runtime error while training Ray Train	1	521	August 26, 2022
RuntimeError : Socket Timeout (ProcessGoupGloo)	3	985	January 11, 2023

TorchTrainer Timed out waiting 1800000 ms for send operation to complete

Related topics