Does RayData Support multi-node vllm inference

DriverSong · May 20, 2025, 12:10pm

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

Ray version: 2.43.0
Python version: 3.10.12
OS: Ubuntu 22.04.4 LTS
Cloud/Infrastructure: 8*L40 per node of k8s
Other libs/tools (if relevant):
- vllm version: 0.8.5

3. What happened vs. what you expected:

Expected:
I attempt to use RayData to create deepseek-r1 inference job by vllm with 16 * L40 with 8 * L40 per node.
Actual:
While I set the num_gpus of map_batches to 16, the cluster return the res:

(autoscaler +44s) Error: No available node types can fulfill resource request {'GPU': 16.0, 'CPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(autoscaler +1m19s) Error: No available node types can fulfill resource request {'CPU': 1.0, 'GPU': 16.0}. Add suitable node types to this cluster to resolve this issue.

While the max gpu count per actor is limited by the node size, there will never be an actor with 16 gpu.
Then, I change the num_gpus of map_batches to 0, execute the vllm main process in the map_batches actor with only cpu, and got the following Error:

(MapWorker(MapBatches(SubClass)) pid=41596, ip=****) INFO 05-20 18:19:06 [config.py:717] This model supports multiple tasks: {'embed', 'score', 'generate', 'classify', 'reward'}. Defaulting to 'generate'.
(MapWorker(MapBatches(SubClass)) pid=41596, ip=****) INFO 05-20 18:19:07 [arg_utils.py:1669] Engine in background thread is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
(MapWorker(MapBatches(SubClass)) pid=41596, ip=****) INFO 05-20 18:19:07 [config.py:1770] Defaulting to use ray for distributed inference
(MapWorker(MapBatches(SubClass)) pid=41596, ip=****) WARN 05-20 18:19:07 [fp8.py:63] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(MapWorker(MapBatches(SubClass)) pid=41596, ip=****) Failed to initialize model or tokenizer: 'NoneType' object is not subscriptable

Ray’s resource isolation makes GPU resources invisible to the main process of vllm, causing vllm initialization failure.
Finally, I change the num_gpus of map_batches to 1 and get the following log

(MapWorker(MapBatches(SubClass)) pid=42405, ip=****) INFO 05-20 18:34:09 [ray_utils.py:233] Waiting for creating a placement group of specs for 10 seconds. specs=[{'GPU': 1.0, 'node:****': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` and `ray list nodes` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.

This time, vllm is pending when applying for placemant_group. This is because the entire cluster has only 16 * L40. While the vllm main process occupies one card, only 15 * L40 available.

Does RayData not support the multi-mode inference of vllm? or one card must be wasted for the initialization of vllm with RayData.

btw, I also try RayServe to solve this problem, while it also has the arg num_gpus, and I guess it will not work.

Will Ray support this scenarios? Or anyone have similar demand scenarios with solutions? At present, I can only think of building an additional platform on top of ray (kuberay or k8s-operator) to achieve the management of multi-node vllm inference tasks on ray.

rliaw · May 22, 2025, 6:59pm

Did you try Ray Data LLM? Working with LLMs — Ray 3.0.0.dev0

kourosh · May 23, 2025, 8:14pm

On Ray serve side, this example should be a good starting point. Serve DeepSeek — Ray 2.46.0

Topic		Replies	Views
vLLM Inferencing on multiGPU Ray Serve	7	950	September 24, 2024
IDLE ray worker nodes in map_batches	2	373	May 24, 2023
VLLM will report gpu missing on the hosting node in Ray Ray Clusters	2	293	February 4, 2025
Scaling Ray serve with vLLM beyond 2 GPUs Ray Serve	1	2344	February 5, 2024
Multi GPU Usage on Multi VM\|Ray cluster on multi VM instances Ray Clusters	5	1400	January 17, 2025

Does RayData Support multi-node vllm inference

Related topics