I have 2 CUDA GPU resources on my server that is running ray. If I deploy a single model replica with num_gpus": 1}
, things are fine. If I set the number of replicas to 2, I get a "RuntimeError: CUDA out of memory. " error when Ray tries to deploy the 2nd replica. How are people deploying multiple replicas with multiple GPUs? Any sample code/gist or suggestions to try would be appreciated.
Hi @puntime_error, when you set the replicas to 2 and keep num_gpus to 1, there should be two identical process created, each taking one GPU with their corresponding CUDA_VISIBLE_DEVICES.
Can you take a look at nvidia-smi
before and after and make sure there are no other processes running on the two GPUs?
@simon-mo @puntime_error am unable to use multiple gpus while serving ray.
I have raised an issue here Issue on page /serve/getting_started.html · Issue #27905 · ray-project/ray · GitHub
Can you please help?
Hi @Sujit_Kumar, to narrow down the issue, do the basic GPU examples at GPU Support — Ray 3.0.0.dev0 work for you? You can try it on Ray 2.0.0rc1 (pip install "ray[serve, default]==2.0.0rc1"
)
@Sujit_Kumar you may also have to tell the transformers
library to use GPU; see How to make transformers examples use GPU? · Issue #2704 · huggingface/transformers · GitHub for example and the surrounding context.