Scaling Ray serve with vLLM beyond 2 GPUs

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am trying to deploy the vLLM with Ray serve at scale on Ray cluster using Kuberay on a k3 cluster. The cluster consists of two nodes with 8 and 4 A100 80GB GPUs respectively.

I am trying to deploy a Ray seve with vLLM (asyn engine) for serving a llama 2 7b model.

For one deployment, irrespective of the batch size I am setting, all the input requests are hitting the single replica. The same is true for 2 replicas.
But beyond 3, even if the input number of concurrent requests are 1000, each replicas are only getting less than 30 concurrent requests.

What I tried

  • Trying around with different batch sizes and max concurrent queries to increase number of requests pushed to each replicas
  • Tried both local and Kuberay Ray cluster deployments.

Any help or pointers would be appreciated here. Help needed urgently.

Are you using ray-llm? That’s a framework built on top of Ray Serve and vLLM. It supports Llama-2-7b out of the box.