Hello, my team is working on integrating Ray Serve into our model serving flow, and are trying to figure out the best practice for serving many different models concurrently on k8s. Basically the situation is that we have many different models that need to be deployed and stopped automatically, and inference requests which should be routed to the corresponding model. We’d love to use as much of ray as possible to handle the workload, but it seems like there are several approaches.
We have three different flows we could imagine here, and would love some feedback on how best to approach this problem, and how much of it can be solved by Ray
Ray Serve on a ray cluster – deploying a new model corresponds to creating a new ray job, which contains a Serve instance. Each serve instance will be automatically populated with a new port which will have to be automatically forwarded. One big benefit of this approach is that each instance of ray serve is treated as a job – easy to visualize, start, and stop.
Ray Serve on Rayservice – deploying a new model corresponds to some state change inside of the rayservice instance – each model represents a different path in the service graph.
Ray Serve on Rayservice with individual rayservices for each model. – each model once deployed would have a unique service which would then be accessed by the fastapi server. This is appealing because of its simplicity and it retains the fault-tolerance of the rayservice, however it requires the automation of kubernetes service creation
Hi @banjo, thanks for the question. Could you give more details on why you need to launch many models and stop them automatically? Is it a cost optimization (so you don’t have lots of models running when they’re not needed), or is it because you don’t know which models you’ll need to deploy up front?
One mental model for Ray Serve is that each Ray Serve application should be a collection of deployments that serve a single use case. That’s why whenever you update a single model in an application, you need to update the whole graph. I believe your use case probably warrants splitting each model into a different Serve app and deploying them independently. The Serve team is working on letting you do this in a single cluster (see this RFC), but for now you’d need to do something like you described in option 3. Out of curiosity, how hard would it be to build that service creation infrastructure?
Hi @shrekris, thanks for the followup. The use-case we want to support is the deployment of arbitrary user-defined models, which are not known ahead of time, so while we have the broad format of the models, we don’t know which weights or which versions are used. In addition to the wide variablility here, it also is costly to deploy individual services for each model. Since we can define the use cases ahead of time, it did seem reasonable to fit the ray serve application into a collection of deployments (per your note), in which a user flag might route through the correct model. For example, you could imagine the case of replicate.com, in which they have many different models, but can route requests through most of the models, so just need to select which one.
To simplify that further – if we could just define a deployment which had a diffusion loaded into it, and every new instantiation of that deployment could have a diffusion model with different weights, that example case would help us a lot.
I am familiar with that RFC, although I understand it is not yet supported on kuberay? A perfect solution for us would be if there was some way to add a deployment into the serve instance (while holding the graph the same, just with another case that would replicate the inputs and outputs), or add another serve instance onto the cluster programmatically.
Will anything like that be possible with the RFC? Or will it still require all the models to be defined in one go, as opposed to through separate job submissions?
In terms of the service creation infrastructure, it’s definitely doable, although would remove a lot of the abstraction that we enjoy from ray. It also would be less than ideal to have numerous different ray dashboards setup to track usage.
Thanks for elaborating on your use case! I think the ask is reasonable– you’d like to create some number of deployments and load a set of different models on to those deployments, perhaps by loading different weights on different replicas and routing requests to the relevant replica.
The multi-app changes probably won’t address this issue directly, although it will let you support different use cases on a single cluster. There’s no out-of-the-box way to programmatically add new deployments, but you should be able to take in Serve configs for new Serve apps, append them to an existing config, and then submit the new config to your cluster.
Could you create a feature request, and link this discussion there? We can track it and see if there’s a more direct way to support this workload.
Ah, I missed your link to Ray 2.3.0. Yes, if you pass in a name to serve.run that’s distinct from other running Serve instances, you can run the new Serve app without tearing down the other pre-existing apps.