Best practices to run multiple models in multiple GPUs in RayLLM

Hi team, I am trying out RayLLM with KubeRay recently following example here, I am trying to deploy a Mistral-7b model on 1xA10 GPU and 4x A10 GPU. I am expecting 4x throughput but only get 1.x improvement.

Would like to get some advice from experts here on the parameter tuning as I feel like the parameters in the above example are not optimised.

Few questions:

  1. if I want to utilise 4 GPUs, I should set 4 workers with 1 replica each or 1 worker with 4 replicas, and what is the difference here?
  2. if I want to run 2 models (different models) in a single GPU what are the best practices on the Ray Serve configuration for each of the models?

Thanks.