TorchTrainer hangs when only 1 worker raises error

saurabh3949 · November 1, 2022, 6:55pm

Supplying a lower timeout_s to TorchConfig helps, but I’d still expect ray to throw the error immediately.

    trainer = TorchTrainer(
        train_fn,
        scaling_config=ScalingConfig(
            num_workers=6, use_gpu=use_gpu),
        torch_config=TorchConfig(timeout_s=10)
    )

Topic		Replies	Views
TorchTrainer: Collective operation timeout: WorkNCCL	2	1721	July 18, 2023
Any suggestions on how to debug the distributed torch trainer Dashboard, Monitoring & Debugging	7	869	June 9, 2021
Errors when test TorchTrainer with the "getting started" code Ray Train	1	524	October 1, 2021
Hang when training across two machines	0	283	September 1, 2023
RuntimeError: Some workers returned results while others didn't. Make sure that `train.report()` and `train.checkpoint()` are called the same number of times on all workers Ray Train	1	681	April 16, 2022

TorchTrainer hangs when only 1 worker raises error

Related topics