How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
in our production serving, we have carefully tuned our serving using a composition of 10+ models together to serve a single user request, they now were just able to fit a A100 40G GPU (including model themselves + inference VRAM consumption.) we have been using 5 serve deployments to compose this production workflow.
Now we want to scale our production QPS capability, by either launching single node attaching 2 GPUs or 1 head node + 2 worker node (each with 1 gpu).
However after some researching, i dont think there is a clear way to achieve the following:
Allocate all 5 serve deployments into GPU 0, then allocate another 5 deployments (they are the replicas of those first 5 deployments) into GPU 1, so that we could achieve ~2X QPS increase.
i looked at following options:
- placement group
- manually setting CUDA_VISIBLE_DEVICE
but my understanding is that they dont provide a solution to get what i want.
If there is no other solutions, then my last resort would be to make a big code refactor so that all 10+ models are directly coded in a single deployment code, then each deployment asks for 1 GPU.
Thank in advance!
1 Like
Allocate all 5 serve deployments into GPU 0, then allocate another 5 deployments (they are the replicas of those first 5 deployments)
Just to make sure i understand you properly: did you mean to say that
- There are 5 Serve Deployments total
- Each deployment have 2 replicas (10 total for all deployments)
- You want replicas from one “group” (of 5 deployments) only talk to each other, but not other replicas?
- Make sure all replicas in a group are scheduled onto the same GPU
Hi Alex, your understanding is correct!
Placement groups should get you what you need here. Essentially if you want to scale all 5 models as one unit and not be allowed to interact with replicas on other gpus, you should put them in one serve deployment in a placement group.
Thanks, so i guess putting them into one serve deployment is a hard requirement ?
Yes, right now there’s no way to maintain node-based affinity between replicas. Therefore, if you want to achieve that you’d have to keep them as a single serve deployment.
Tangential though, based on the description you’ve provided shaping your application as a unitary Serve deployment makes the most sense as 5 deployments you have right now don’t actually seem to be separated in that sense that they, for ex, can’t autoscale independently and in essence are tightly bound.