VLLM will report gpu missing on the hosting node in Ray

Context

We are using Ray Serve to deploy a vLLM App. It was working well til recently we upgrade version of vLLM to 0.7.0 and adapted to the API change.

We have a instance with six 4090 GPUs. We deployed a Ray cluster on it with one head node and 5 worker nodes. All are docker containers. Each container is attached to a gpu.

Issue

The core issue is that whenever the vLLM app tries to load the model from the disk, it fails to find GPU to the container where the APP is hosted.

  • All containers have exactly the same ENV
  • We can use vllm cli to run the model directly without any issue.
  • We tried to re-deploy many times. Each time the APP could be hosted on an arbitrary node. Whenever a node becomes the hosting node, it will throw error of that it does not have GPU. But in the cases when that node is not the hosted node, it can work well and load model weights smoothly.
  • We are now trying to downgrade the vllm version but want to get an idea if this is a bug or our usage issue. Thanks!

Log

 ValueError: Current node has no GPU available. current_node_resource={'node:172.17.0.6_group_0_5ea4bf00a38e2ed9e9af4d4e2c3d2c000000': 0.001, 'accelerator_type:G': 1.0, 'node:172.17.0.6_group_5ea4bf00a38e2ed9e9af4d4e2c3d2c000000': 0.001, 'CPU': 63.0, 'memory': 10593529856.0, 'object_store_memory': 4540084224.0, 'node:172.17.0.6': 0.999, 'bundle_group_0_5ea4bf00a38e2ed9e9af4d4e2c3d2c000000': 999.999, 'bundle_group_5ea4bf00a38e2ed9e9af4d4e2c3d2c000000': 999.999}. vLLM engine cannot start without GPU. Make sure you have at least 1 GPU available in a node current_node_id='6f6229bb91687736efbb6174c5885ad0ad7f5aa6fb53ad20afaee93a' current_ip='172.17.0.6'.

Full error log: gist:0aea4772b3273a2e9a6427c77eb25354 · GitHub

Reproduce

Python package version

vllm==0.7.0
ray==2.41.0
ray[serve]==2.41.0

Hi! I’ve investigated this and talked to some of the Ray engineers and this is a known issue that they are currently working on fixing. In the meantime, can you try doing distributed_executor_backend="mp" and see if that fixes the issue?

1 Like

Thanks! I will try and report back the result

1 Like