Model replication with multiple GPU deployments

I have 2 CUDA GPU resources on my server that is running ray. If I deploy a single model replica with num_gpus": 1}, things are fine. If I set the number of replicas to 2, I get a "RuntimeError: CUDA out of memory. " error when Ray tries to deploy the 2nd replica. How are people deploying multiple replicas with multiple GPUs? Any sample code/gist or suggestions to try would be appreciated.

Hi @puntime_error, when you set the replicas to 2 and keep num_gpus to 1, there should be two identical process created, each taking one GPU with their corresponding CUDA_VISIBLE_DEVICES.

Can you take a look at nvidia-smi before and after and make sure there are no other processes running on the two GPUs?

@simon-mo @puntime_error am unable to use multiple gpus while serving ray.
I have raised an issue here Issue on page /serve/getting_started.html · Issue #27905 · ray-project/ray · GitHub
Can you please help?

Hi @Sujit_Kumar, to narrow down the issue, do the basic GPU examples at GPU Support — Ray 3.0.0.dev0 work for you? You can try it on Ray 2.0.0rc1 (pip install "ray[serve, default]==2.0.0rc1")

@Sujit_Kumar you may also have to tell the transformers library to use GPU; see How to make transformers examples use GPU? · Issue #2704 · huggingface/transformers · GitHub for example and the surrounding context.