Hi team, I am trying out RayLLM with KubeRay recently following example here, I am trying to deploy a Mistral-7b model on 1xA10 GPU and 4x A10 GPU. I am expecting 4x throughput but only get 1.x improvement.
Would like to get some advice from experts here on the parameter tuning as I feel like the parameters in the above example are not optimised.
Few questions:
- if I want to utilise 4 GPUs, I should set
4 workers with 1 replica eachor1 worker with 4 replicas, and what is the difference here? - if I want to run 2 models (different models) in a single GPU what are the best practices on the Ray Serve configuration for each of the models?
Thanks.