Scaling Ray serve with vLLM beyond 2 GPUs

Kannan · January 28, 2024, 4:05am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I am trying to deploy the vLLM with Ray serve at scale on Ray cluster using Kuberay on a k3 cluster. The cluster consists of two nodes with 8 and 4 A100 80GB GPUs respectively.

I am trying to deploy a Ray seve with vLLM (asyn engine) for serving a llama 2 7b model.

For one deployment, irrespective of the batch size I am setting, all the input requests are hitting the single replica. The same is true for 2 replicas.
But beyond 3, even if the input number of concurrent requests are 1000, each replicas are only getting less than 30 concurrent requests.

What I tried

Trying around with different batch sizes and max concurrent queries to increase number of requests pushed to each replicas
Tried both local and Kuberay Ray cluster deployments.

Any help or pointers would be appreciated here. Help needed urgently.

shrekris · February 5, 2024, 5:50pm

Are you using ray-llm? That’s a framework built on top of Ray Serve and vLLM. It supports Llama-2-7b out of the box.

Topic		Replies	Views
vLLM Inferencing on multiGPU Ray Serve	7	1095	September 24, 2024
Serving LLM with multiple gpus Ray Serve	0	292	July 3, 2024
Ray Serve LLM example in document cannot work Ray Serve LLM APIs	6	265	April 3, 2025
Multi GPU Usage on Multi VM\|Ray cluster on multi VM instances Ray Clusters	5	1439	January 17, 2025
Running vllm script on multi node cluster Ray Clusters	1	2944	February 9, 2024

Scaling Ray serve with vLLM beyond 2 GPUs

Related topics