Ray Streaming is timing out while trying to get next window

localh · November 8, 2023, 10:18pm

This has been painful to diagnose but I think I have cornered an issue:

My TorchTrainer is reliably crashing on the same step which corresponds to the window it is on for streaming big data.

I get errors like this after crunching through the first window:

Warning: reader on shard 18 of the pipeline has been blocked more than 2560.0s waiting for other readers to catch up. All pipeline shards must be read from concurrently.

and later

 [Rank 23] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2006, OpType=BROADCAST, Timeout(ms)=5400000) ran for 5400048 milliseconds before timing out

I have inferred the error from below, which suggests I have 10 windows. 10 windows * (batch_size * step) = roughly my training data set.

e[2me[36m(TorchTrainer pid=158165, ip=10.0.0.8)e[0m Created DatasetPipeline with 10 windows: 134.91GiB min, 184.61GiB max, 179.56GiB mean

e[2me[36m(TorchTrainer pid=158165, ip=10.0.0.8)e[0m Blocks per window: 588 min, 738 max, 715 mean

e[2me[36m(TorchTrainer pid=158165, ip=10.0.0.8)e[0m ✔️ This pipeline's per-window parallelism is high enough to fully utilize the cluster.

e[2me[36m(TorchTrainer pid=158165, ip=10.0.0.8)e[0m ✔️ This pipeline's windows likely fit in object store memory without spilling.

How can I use Ray to ensure that all workers get their shard of data?

Topic		Replies	Views
Get distributed process group timeout when using torch trainer + FullSyncIterDatapipe Ray Train	5	698	December 20, 2022
TorchTrainer: Collective operation timeout: WorkNCCL	2	1669	July 18, 2023
TorchTrainer Timed out waiting 1800000 ms for send operation to complete	2	213	October 10, 2024
AIR, TorchTrainer, DDP and NCCL Timeout	2	1217	August 17, 2023
DatasetConfig and streaming breaks a worker	0	155	July 27, 2023

Ray Streaming is timing out while trying to get next window

Related topics