How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I am trying to deploy the vLLM with Ray serve at scale on Ray cluster using Kuberay on a k3 cluster. The cluster consists of two nodes with 8 and 4 A100 80GB GPUs respectively.
I am trying to deploy a Ray seve with vLLM (asyn engine) for serving a llama 2 7b model.
For one deployment, irrespective of the batch size I am setting, all the input requests are hitting the single replica. The same is true for 2 replicas.
But beyond 3, even if the input number of concurrent requests are 1000, each replicas are only getting less than 30 concurrent requests.
What I tried
- Trying around with different batch sizes and max concurrent queries to increase number of requests pushed to each replicas
- Tried both local and Kuberay Ray cluster deployments.
Any help or pointers would be appreciated here. Help needed urgently.