Gpu allocation for ray serve on multi gpu environment

lobo3964 · November 13, 2024, 12:30am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

in our production serving, we have carefully tuned our serving using a composition of 10+ models together to serve a single user request, they now were just able to fit a A100 40G GPU (including model themselves + inference VRAM consumption.) we have been using 5 serve deployments to compose this production workflow.

Now we want to scale our production QPS capability, by either launching single node attaching 2 GPUs or 1 head node + 2 worker node (each with 1 gpu).

However after some researching, i dont think there is a clear way to achieve the following:

Allocate all 5 serve deployments into GPU 0, then allocate another 5 deployments (they are the replicas of those first 5 deployments) into GPU 1, so that we could achieve ~2X QPS increase.

i looked at following options:

placement group
manually setting CUDA_VISIBLE_DEVICE

but my understanding is that they dont provide a solution to get what i want.

If there is no other solutions, then my last resort would be to make a big code refactor so that all 10+ models are directly coded in a single deployment code, then each deployment asks for 1 GPU.

Thank in advance!

alexeykudinkin · November 14, 2024, 12:33am

Allocate all 5 serve deployments into GPU 0, then allocate another 5 deployments (they are the replicas of those first 5 deployments)

Just to make sure i understand you properly: did you mean to say that

There are 5 Serve Deployments total
Each deployment have 2 replicas (10 total for all deployments)
You want replicas from one “group” (of 5 deployments) only talk to each other, but not other replicas?
Make sure all replicas in a group are scheduled onto the same GPU

lobo3964 · November 14, 2024, 12:48am

Hi Alex, your understanding is correct!

Akshay_Malik · November 14, 2024, 3:39pm

Placement groups should get you what you need here. Essentially if you want to scale all 5 models as one unit and not be allowed to interact with replicas on other gpus, you should put them in one serve deployment in a placement group.

lobo3964 · November 14, 2024, 6:17pm

Thanks, so i guess putting them into one serve deployment is a hard requirement ?

alexeykudinkin · November 18, 2024, 6:21pm

Yes, right now there’s no way to maintain node-based affinity between replicas. Therefore, if you want to achieve that you’d have to keep them as a single serve deployment.

Tangential though, based on the description you’ve provided shaping your application as a unitary Serve deployment makes the most sense as 5 deployments you have right now don’t actually seem to be separated in that sense that they, for ex, can’t autoscale independently and in essence are tightly bound.

Topic		Replies	Views
How can I assign different GPU for different replicas in Ray Serve? Ray Serve	1	529	July 14, 2022
Model replication with multiple GPU deployments Ray Serve	4	1376	August 16, 2022
Ray serve GPU allocation error, deployment consuming all 8 GPU even though setting num_gpus=4 Ray Serve	1	633	February 2, 2024
Multi GPU Usage on Multi VM\|Ray cluster on multi VM instances Ray Clusters	5	1366	January 17, 2025
Ray Serve - Setting num_replicas > 1 errors out and not using GPU Ray Serve	5	970	January 13, 2022

Gpu allocation for ray serve on multi gpu environment

Related topics