Slow Large-Scale Ingest w/Ray AIR (Ray Data + Ray Train)

matthewdeng · July 28, 2022, 12:28am

Oh wow, this writeup is fantastic!

The tasks per node seems to suggest that – unlike before – the scheduling might not happening according to the SPREAD strategy, as it seems that some workers are processing as much as 3.5x more tasks than others. I’m wondering if this might be the root cause of the slowdown and if so is there a way to specify that I want the BatchMapper to spread the workload?

To be extremely concise, the BatchMapper should already be using SPREAD, but the problem is most like due to this issue which is being actively looked into.

My batch delay seems a little high so maybe I’ll benefit from setting prefetch_blocks=10 or something like that.

One thing to be aware of here is that prefetching blocks will utilize both memory and network transfer, so increasing this too much can poorly impact performance as well!

Also, I’m currently setting the parallelism for my train dataset equal to the number of blocks in my train dataset (i.e. 2000), would decreasing this to say 500 potentially help reduce the number of tasks and speed up performance?

Hmm could you clarify this a bit? How many input files are you reading from? Decreasing the parallelism/number of blocks may certainly help with the issue you were seeing, but the caveat is that each block will be larger - depending on how large they are, this can potentially lead to OOM while running the preprocessing (though it looks like your setup has ample memory to support this).

Topic		Replies	Views
Ray Train with Ray datasets (includes images) too slow Ray Data	5	1284	February 14, 2023
Ray Data streaming not streaming smoothly Ray Data	8	784	May 30, 2023
[Datasets] Create custom dataset by grouping/merging existing blocks Ray Data	9	1307	November 30, 2022
Ray / gRPC Ambiguous Error Message Kubernetes	12	2258	May 13, 2022
Ray Train hangs for long time Ray Train	11	1805	July 20, 2022

Slow Large-Scale Ingest w/Ray AIR (Ray Data + Ray Train)

Related topics