Best practices to run multiple models in multiple GPUs in RayLLM

lizzzcai · February 8, 2024, 3:57am

Hi team, I am trying out RayLLM with KubeRay recently following example here, I am trying to deploy a Mistral-7b model on 1xA10 GPU and 4x A10 GPU. I am expecting 4x throughput but only get 1.x improvement.

Would like to get some advice from experts here on the parameter tuning as I feel like the parameters in the above example are not optimised.

Few questions:

if I want to utilise 4 GPUs, I should set 4 workers with 1 replica each or 1 worker with 4 replicas, and what is the difference here?
if I want to run 2 models (different models) in a single GPU what are the best practices on the Ray Serve configuration for each of the models?

Thanks.

Topic		Replies	Views
Serving LLM with multiple gpus Ray Serve	0	288	July 3, 2024
Multi GPU Usage on Multi VM\|Ray cluster on multi VM instances Ray Clusters	5	1434	January 17, 2025
Serve the same model replicas on the same GPU Ray Serve	0	115	May 23, 2024
Model replication with multiple GPU deployments Ray Serve	4	1391	August 16, 2022
vLLM Inferencing on multiGPU Ray Serve	7	1042	September 24, 2024

Best practices to run multiple models in multiple GPUs in RayLLM

Related topics