How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
in our production serving, we have carefully tuned our serving using a composition of 10+ models together to serve a single user request, they now were just able to fit a A100 40G GPU (including model themselves + inference VRAM consumption.) we have been using 5 serve deployments to compose this production workflow.
Now we want to scale our production QPS capability, by either launching single node attaching 2 GPUs or 1 head node + 2 worker node (each with 1 gpu).
However after some researching, i dont think there is a clear way to achieve the following:
Allocate all 5 serve deployments into GPU 0, then allocate another 5 deployments (they are the replicas of those first 5 deployments) into GPU 1, so that we could achieve ~2X QPS increase.
i looked at following options:
- placement group
- manually setting CUDA_VISIBLE_DEVICE
but my understanding is that they dont provide a solution to get what i want.
If there is no other solutions, then my last resort would be to make a big code refactor so that all 10+ models are directly coded in a single deployment code, then each deployment asks for 1 GPU.
Thank in advance!