How to make sure that each mapping transformation task is running in parallel to get the best throutput?

emengjzs · October 9, 2024, 8:17am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I am new to ray, and inspried by the image in the article, I was decided to use ray-data to implement pipeline parallelism for batch inference task. Pattern: Using pipelining to increase throughput — Ray 2.37.0

I split to inference task into multiple map transform processes, each task processes the data, via model inference or custom computing. Each task is supposed to run in parallelism so that Nth data row can be processed in task M while the N-1th data is processed in task M+1.

But I found that it sometimes make a execution plan that the next map transform task does not begin until the last map transform task comsumes all the data, which makes the pipeline running in low throutput.

So my question is, is there any configuration or settings to controll the execution plan for the batch infer task so that multiple data map transform tasks can be run in parallel to achieve the max throutput?

Topic		Replies	Views
Ray inferencing not happening in streaming way	7	385	December 13, 2023
Dataset support concurrency in one block when using map_batches	4	695	October 1, 2022
Ray 2.9.3: map_batches and multi-gpu -- not processing partition blocks / evenly sharding Ray Data	2	272	March 12, 2024
Single node, 4x GPU, map_batches only using 1 Ray Data	3	694	October 5, 2023
Run ray dataset.map_batch in ray task Ray Client	0	35	November 27, 2024

How to make sure that each mapping transformation task is running in parallel to get the best throutput?

Related topics