Gpu allocation for ray serve on multi gpu environment

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

in our production serving, we have carefully tuned our serving using a composition of 10+ models together to serve a single user request, they now were just able to fit a A100 40G GPU (including model themselves + inference VRAM consumption.) we have been using 5 serve deployments to compose this production workflow.

Now we want to scale our production QPS capability, by either launching single node attaching 2 GPUs or 1 head node + 2 worker node (each with 1 gpu).

However after some researching, i dont think there is a clear way to achieve the following:

Allocate all 5 serve deployments into GPU 0, then allocate another 5 deployments (they are the replicas of those first 5 deployments) into GPU 1, so that we could achieve ~2X QPS increase.

i looked at following options:

  1. placement group
  2. manually setting CUDA_VISIBLE_DEVICE

but my understanding is that they dont provide a solution to get what i want.

If there is no other solutions, then my last resort would be to make a big code refactor so that all 10+ models are directly coded in a single deployment code, then each deployment asks for 1 GPU.

Thank in advance!

1 Like

Allocate all 5 serve deployments into GPU 0, then allocate another 5 deployments (they are the replicas of those first 5 deployments)

Just to make sure i understand you properly: did you mean to say that

  • There are 5 Serve Deployments total
  • Each deployment have 2 replicas (10 total for all deployments)
  • You want replicas from one “group” (of 5 deployments) only talk to each other, but not other replicas?
  • Make sure all replicas in a group are scheduled onto the same GPU

Hi Alex, your understanding is correct!

Placement groups should get you what you need here. Essentially if you want to scale all 5 models as one unit and not be allowed to interact with replicas on other gpus, you should put them in one serve deployment in a placement group.

Thanks, so i guess putting them into one serve deployment is a hard requirement ?

Yes, right now there’s no way to maintain node-based affinity between replicas. Therefore, if you want to achieve that you’d have to keep them as a single serve deployment.

Tangential though, based on the description you’ve provided shaping your application as a unitary Serve deployment makes the most sense as 5 deployments you have right now don’t actually seem to be separated in that sense that they, for ex, can’t autoscale independently and in essence are tightly bound.