Slow Large-Scale Ingest w/Ray AIR (Ray Data + Ray Train)

Oh wow, this writeup is fantastic!

The tasks per node seems to suggest that – unlike before – the scheduling might not happening according to the SPREAD strategy, as it seems that some workers are processing as much as 3.5x more tasks than others. I’m wondering if this might be the root cause of the slowdown and if so is there a way to specify that I want the BatchMapper to spread the workload?

To be extremely concise, the BatchMapper should already be using SPREAD, but the problem is most like due to this issue which is being actively looked into.

My batch delay seems a little high so maybe I’ll benefit from setting prefetch_blocks=10 or something like that.

One thing to be aware of here is that prefetching blocks will utilize both memory and network transfer, so increasing this too much can poorly impact performance as well!

Also, I’m currently setting the parallelism for my train dataset equal to the number of blocks in my train dataset (i.e. 2000), would decreasing this to say 500 potentially help reduce the number of tasks and speed up performance?

Hmm could you clarify this a bit? How many input files are you reading from? Decreasing the parallelism/number of blocks may certainly help with the issue you were seeing, but the caveat is that each block will be larger - depending on how large they are, this can potentially lead to OOM while running the preprocessing (though it looks like your setup has ample memory to support this).

1 Like