Hi team, I am trying out RayLLM with KubeRay recently following example here, I am trying to deploy a Mistral-7b model on 1xA10 GPU and 4x A10 GPU. I am expecting
4x throughput but only get
Would like to get some advice from experts here on the parameter tuning as I feel like the parameters in the above example are not optimised.
- if I want to utilise 4 GPUs, I should set
4 workers with 1 replica eachor
1 worker with 4 replicas, and what is the difference here?
- if I want to run 2 models (different models) in a single GPU what are the best practices on the Ray Serve configuration for each of the models?