However, it is streaming very slowly now, so while more managed I can’t get good GPU utilization still.
How’s the CPU and network utilization? I would guess there’s some resource bottleneck now, such as not enough tasks reading data in parallel. Merging the read and preprocessing could cause this if the read needed more parallelism than the preprocessing. Increasing the prefetch also wouldn’t help in this case.
Follow-up question on the streaming flow, does it wait for the entire allocated object store to fill up before passing on the next batch to the
iter_batchesor that is happening as soon as there are enough rows to fulfill a batch?
It should be the latter.
If it does pass on the batch as soon as there is enough data, does it do this even when this condition is fulfilled halfway through a block? Meaning does setting the batch size small on the map_batches help?
It has to fetch at least one block before data can be returned. So say you set prefetch_batches=10, and the batch size is 64, and each block is 1000 rows, then one block would be prefetched at a time. If the block is very small, then a lot more would get prefetched.