Ray Data streaming not streaming smoothly

ericl · May 30, 2023, 8:41pm

However, it is streaming very slowly now, so while more managed I can’t get good GPU utilization still.

How’s the CPU and network utilization? I would guess there’s some resource bottleneck now, such as not enough tasks reading data in parallel. Merging the read and preprocessing could cause this if the read needed more parallelism than the preprocessing. Increasing the prefetch also wouldn’t help in this case.

Follow-up question on the streaming flow, does it wait for the entire allocated object store to fill up before passing on the next batch to the iter_batches or that is happening as soon as there are enough rows to fulfill a batch?

It should be the latter.

If it does pass on the batch as soon as there is enough data, does it do this even when this condition is fulfilled halfway through a block? Meaning does setting the batch size small on the map_batches help?

It has to fetch at least one block before data can be returned. So say you set prefetch_batches=10, and the batch size is 64, and each block is 1000 rows, then one block would be prefetched at a time. If the block is very small, then a lot more would get prefetched.

Topic		Replies	Views
Benchmarks for Ray Data? Ray Data	13	1140	October 5, 2023
Slow Large-Scale Ingest w/Ray AIR (Ray Data + Ray Train)	20	1801	July 28, 2022
Large dataset ray dataset OOM Ray Tune	2	479	July 3, 2023
Data loading of parquet files is very memory consuming Ray Data	2	1485	June 21, 2022
Ray inferencing not happening in streaming way	7	438	December 13, 2023

Ray Data streaming not streaming smoothly

Related topics