How to do scheduling and message communication in Serve DeploymentHandles

My model sering consists of several components, and I found Deployment composition suits my requirement. My questions:

  1. Can I select which downstream deployment instances to call? IIUC, calling via DeploymentHandles it will route to any instances meet the load balance requirement? But if my business logic need to schedule specific downstream deployment instances, how can I do that?
  2. There are requirement of huge memory exchange between deployment instances. How can I do that? for example, via RDMA, distributed memory sharing, etc?
  1. You cannot currently control which replica a request ends up on. Could you say more about your use case? Can your deployment be written in a stateless way?
  2. Ray itself has some distributed memory management. Could you check that doc and see if it works for your use case?

Hi, shrekris,

  1. My use case is trying to serving a LLM on k8s, and a typical LLM request serving consists of the Prefill stage and Decode stage, and these two stages need to exchange GPU memory(the kvcache). I want to leverage hetergorous accelerator nodes for these two stages(since prefill stage is more compute-intensive, while decode stage is memory-intensive). Thus I plan to use two type of deployment instances for these two stages, and I need to do careful scheduling for choosing which prefill instance then decode instance. This mandates a user-defined intricate routing policy.
  2. As mentioned in 1, different Deployments need to exchange GPU memory. If the accelerator on both deployment instances are of Nvidia GPU, then we can leverage the GPUDirect RDMA; otherwise we need to offload to CPU memory then do memory sharing or transport. So I need to maintain this memory exchange abstration logic, and do some low-level stuff. Does Ray support this kind of customs?