Ray
TorchTrainer: Collective operation timeout: WorkNCCL
saivivek15
July 17, 2023, 4:43pm
2
@kai
@xwjiang2010
, could you please suggest any workaround for this issue?
AIR, TorchTrainer, DDP and NCCL Timeout
show post in topic
Related topics
Topic
Replies
Views
Activity
Errors when test TorchTrainer with the "getting started" code
Ray Train
1
522
October 1, 2021
TorchTrainer hangs when only 1 worker raises error
15
1004
November 2, 2022
Segfault in torchtrainer for num_workers > 0 in dataloader
1
603
April 9, 2023
Get distributed process group timeout when using torch trainer + FullSyncIterDatapipe
Ray Train
5
692
December 20, 2022
CUDA-capable device(s) is/are busy or unavailable
Ray Clusters
1
912
February 1, 2023