1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.43.0
- Python version: 3.10.12
- OS: Ubuntu 22.04.4 LTS
- Cloud/Infrastructure: 8*L40 per node of k8s
- Other libs/tools (if relevant):
- vllm version: 0.8.5
3. What happened vs. what you expected:
- Expected:
I attempt to use RayData to createdeepseek-r1
inference job byvllm
with 16 * L40 with 8 * L40 per node. - Actual:
While I set thenum_gpus
ofmap_batches
to16
, the cluster return the res:
(autoscaler +44s) Error: No available node types can fulfill resource request {'GPU': 16.0, 'CPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(autoscaler +1m19s) Error: No available node types can fulfill resource request {'CPU': 1.0, 'GPU': 16.0}. Add suitable node types to this cluster to resolve this issue.
While the max gpu count per actor is limited by the node size, there will never be an actor with 16 gpu.
Then, I change the num_gpus
of map_batches
to 0
, execute the vllm main process in the map_batches actor with only cpu, and got the following Error:
(MapWorker(MapBatches(SubClass)) pid=41596, ip=****) INFO 05-20 18:19:06 [config.py:717] This model supports multiple tasks: {'embed', 'score', 'generate', 'classify', 'reward'}. Defaulting to 'generate'.
(MapWorker(MapBatches(SubClass)) pid=41596, ip=****) INFO 05-20 18:19:07 [arg_utils.py:1669] Engine in background thread is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
(MapWorker(MapBatches(SubClass)) pid=41596, ip=****) INFO 05-20 18:19:07 [config.py:1770] Defaulting to use ray for distributed inference
(MapWorker(MapBatches(SubClass)) pid=41596, ip=****) WARN 05-20 18:19:07 [fp8.py:63] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(MapWorker(MapBatches(SubClass)) pid=41596, ip=****) Failed to initialize model or tokenizer: 'NoneType' object is not subscriptable
Ray’s resource isolation makes GPU resources invisible to the main process of vllm, causing vllm initialization failure.
Finally, I change the num_gpus
of map_batches
to 1
and get the following log
(MapWorker(MapBatches(SubClass)) pid=42405, ip=****) INFO 05-20 18:34:09 [ray_utils.py:233] Waiting for creating a placement group of specs for 10 seconds. specs=[{'GPU': 1.0, 'node:****': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` and `ray list nodes` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.
This time, vllm is pending when applying for placemant_group. This is because the entire cluster has only 16 * L40. While the vllm main process occupies one card, only 15 * L40 available.
Does RayData not support the multi-mode inference of vllm? or one card must be wasted for the initialization of vllm with RayData.
btw, I also try RayServe to solve this problem, while it also has the arg num_gpus
, and I guess it will not work.
Will Ray support this scenarios? Or anyone have similar demand scenarios with solutions? At present, I can only think of building an additional platform on top of ray (kuberay or k8s-operator) to achieve the management of multi-node vllm inference tasks on ray.