Model replication with multiple GPU deployments

I have 2 CUDA GPU resources on my server that is running ray. If I deploy a single model replica with num_gpus": 1}, things are fine. If I set the number of replicas to 2, I get a "RuntimeError: CUDA out of memory. " error when Ray tries to deploy the 2nd replica. How are people deploying multiple replicas with multiple GPUs? Any sample code/gist or suggestions to try would be appreciated.

Hi @puntime_error, when you set the replicas to 2 and keep num_gpus to 1, there should be two identical process created, each taking one GPU with their corresponding CUDA_VISIBLE_DEVICES.

Can you take a look at nvidia-smi before and after and make sure there are no other processes running on the two GPUs?