How to do scheduling and message communication in Serve DeploymentHandles

gr33nw00d · July 18, 2024, 2:13am

My model sering consists of several components, and I found Deployment composition suits my requirement. My questions:

Can I select which downstream deployment instances to call? IIUC, calling via DeploymentHandles it will route to any instances meet the load balance requirement? But if my business logic need to schedule specific downstream deployment instances, how can I do that?
There are requirement of huge memory exchange between deployment instances. How can I do that? for example, via RDMA, distributed memory sharing, etc?

shrekris · July 30, 2024, 5:09pm

You cannot currently control which replica a request ends up on. Could you say more about your use case? Can your deployment be written in a stateless way?
Ray itself has some distributed memory management. Could you check that doc and see if it works for your use case?

gr33nw00d · July 31, 2024, 12:05am

Hi, shrekris,

My use case is trying to serving a LLM on k8s, and a typical LLM request serving consists of the Prefill stage and Decode stage, and these two stages need to exchange GPU memory(the kvcache). I want to leverage hetergorous accelerator nodes for these two stages(since prefill stage is more compute-intensive, while decode stage is memory-intensive). Thus I plan to use two type of deployment instances for these two stages, and I need to do careful scheduling for choosing which prefill instance then decode instance. This mandates a user-defined intricate routing policy.
As mentioned in 1, different Deployments need to exchange GPU memory. If the accelerator on both deployment instances are of Nvidia GPU, then we can leverage the GPUDirect RDMA; otherwise we need to offload to CPU memory then do memory sharing or transport. So I need to maintain this memory exchange abstration logic, and do some low-level stuff. Does Ray support this kind of customs?

Topic		Replies	Views
Replica schedule policy in compositions of models Ray Serve	1	125	March 29, 2024
Ray serve with dynamic deployments Ray Libraries (Data, Train, Tune, Serve)	0	544	September 23, 2022
Concurrency groups in Serve Deployments Ray Serve	1	19	September 5, 2024
Deployment Graph vs ServeHandle Ray Serve	5	551	June 1, 2022
Ray serve on Kubernetes Ray Serve	14	845	March 27, 2024