Does map_batches avoid saturating the inference engine?

shevateng0 · May 21, 2025, 12:09am

I am experimenting with Ray Data for batch inference. Ray runs in standalone mode as I only use one GPU node.
map_batches() is used to perform batch inference. I want to know if batch inference via Ray will slow down the inference speed in standalone mode. The way I experiment is:

Pick only one data batch of size 1000 (sorry not able to disclose the info of the data)
Feed to (a) inference instantiated by vllm.LLM or sglang.Engine, and (b) vllm or sglang engine host in map_batches()
Compare the time used to finish inference.

I notice that map_batches() is much slower than using the inference engine directly, like 50 mins vs 37 mins!

I want to understand what causes the slowness. Since I kept seeing the engine logs of SGLang popping “decode out of memory”, I wonder if Ray intentionally avoid saturating the engine too much to avoid crash. I also wonder if there is anyway we can speed up the inference.

Thanks.

lkchen · May 25, 2025, 11:30pm

Hi Sheva, thanks for the great quesiton!

Would you mind sharing the settings you have?

Previously I’ve done some investigations with vLLM 0.8.4 (with VLLM_USE_V1=1), you can see more details in this PR:

Prefer to use V1 over V0 for improved performance
sync to the latest Ray version - the above mentioned PR landed after 2.46.0, and 2.47.0 is coming soon
keep batch_size reasonably small (16-32) - the larger it is the more severe the long-tail problem, the smaller it is the more overhead
increase max_concurrent_batches such that max_concurrent_batches * batch_size is large enough to saturate vLLM (monitor the warnings from vllm_engine_stage.py)

Utilizing ray.data does not magically boost LLM engines on single machine. However, the real value of ray.data is horizontal scaling, that is: if you plan to scale out the workload to multiple nodes/machines, ray.data abstracts the scheduling, orchestration, and fault tolerance, making distributed execution much easier without requiring manual coordination.

Would love to hear what results you get with the above tweaks—also keep in mind the long-tail behavior mentioned in the PR might vary based on your specific prompt.

Topic		Replies	Views
Ray inferencing not happening in streaming way	7	395	December 13, 2023
Single node, 4x GPU, map_batches only using 1 Ray Data	3	740	October 5, 2023
Slow Large-Scale Ingest w/Ray AIR (Ray Data + Ray Train)	20	1703	July 28, 2022
[Core] Question on optimizing machine learning project speed using ray Ray Core	5	466	February 1, 2022
Using ray with transformers pipeline for inference Ray Core	0	513	August 19, 2021

Does map_batches avoid saturating the inference engine?

Related topics