Hi team, I am trying out RayLLM with KubeRay recently following example here, I am trying to deploy a Mistral-7b model on 1xA10 GPU and 4x A10 GPU. I am expecting 4x
throughput but only get 1.x
improvement.
Would like to get some advice from experts here on the parameter tuning as I feel like the parameters in the above example are not optimised.
Few questions:
- if I want to utilise 4 GPUs, I should set
4 workers with 1 replica each
or1 worker with 4 replicas
, and what is the difference here? - if I want to run 2 models (different models) in a single GPU what are the best practices on the Ray Serve configuration for each of the models?
Thanks.