I am experimenting with Ray Data for batch inference. Ray runs in standalone mode as I only use one GPU node.
map_batches() is used to perform batch inference. I want to know if batch inference via Ray will slow down the inference speed in standalone mode. The way I experiment is:
Pick only one data batch of size 1000 (sorry not able to disclose the info of the data)
Feed to (a) inference instantiated by vllm.LLM or sglang.Engine, and (b) vllm or sglang engine host in map_batches()
Compare the time used to finish inference.
I notice that map_batches() is much slower than using the inference engine directly, like 50 mins vs 37 mins!
I want to understand what causes the slowness. Since I kept seeing the engine logs of SGLang popping “decode out of memory”, I wonder if Ray intentionally avoid saturating the engine too much to avoid crash. I also wonder if there is anyway we can speed up the inference.
Previously I’ve done some investigations with vLLM 0.8.4 (with VLLM_USE_V1=1), you can see more details in this PR:
Prefer to use V1 over V0 for improved performance
sync to the latest Ray version - the above mentioned PR landed after 2.46.0, and 2.47.0 is coming soon
keep batch_size reasonably small (16-32) - the larger it is the more severe the long-tail problem, the smaller it is the more overhead
increase max_concurrent_batches such that max_concurrent_batches * batch_size is large enough to saturate vLLM (monitor the warnings from vllm_engine_stage.py)
Utilizing ray.data does not magically boost LLM engines on single machine. However, the real value of ray.data is horizontal scaling, that is: if you plan to scale out the workload to multiple nodes/machines, ray.data abstracts the scheduling, orchestration, and fault tolerance, making distributed execution much easier without requiring manual coordination.
Would love to hear what results you get with the above tweaks—also keep in mind the long-tail behavior mentioned in the PR might vary based on your specific prompt.