Automating the serving of many different models

Hello, my team is working on integrating Ray Serve into our model serving flow, and are trying to figure out the best practice for serving many different models concurrently on k8s. Basically the situation is that we have many different models that need to be deployed and stopped automatically, and inference requests which should be routed to the corresponding model. We’d love to use as much of ray as possible to handle the workload, but it seems like there are several approaches.

We have three different flows we could imagine here, and would love some feedback on how best to approach this problem, and how much of it can be solved by Ray

  1. Ray Serve on a ray cluster – deploying a new model corresponds to creating a new ray job, which contains a Serve instance. Each serve instance will be automatically populated with a new port which will have to be automatically forwarded. One big benefit of this approach is that each instance of ray serve is treated as a job – easy to visualize, start, and stop.

  2. Ray Serve on Rayservice – deploying a new model corresponds to some state change inside of the rayservice instance – each model represents a different path in the service graph.

  3. Ray Serve on Rayservice with individual rayservices for each model. – each model once deployed would have a unique service which would then be accessed by the fastapi server. This is appealing because of its simplicity and it retains the fault-tolerance of the rayservice, however it requires the automation of kubernetes service creation

1 Like

Hi @banjo, thanks for the question. Could you give more details on why you need to launch many models and stop them automatically? Is it a cost optimization (so you don’t have lots of models running when they’re not needed), or is it because you don’t know which models you’ll need to deploy up front?

One mental model for Ray Serve is that each Ray Serve application should be a collection of deployments that serve a single use case. That’s why whenever you update a single model in an application, you need to update the whole graph. I believe your use case probably warrants splitting each model into a different Serve app and deploying them independently. The Serve team is working on letting you do this in a single cluster (see this RFC), but for now you’d need to do something like you described in option 3. Out of curiosity, how hard would it be to build that service creation infrastructure?

Hi @shrekris, thanks for the followup. The use-case we want to support is the deployment of arbitrary user-defined models, which are not known ahead of time, so while we have the broad format of the models, we don’t know which weights or which versions are used. In addition to the wide variablility here, it also is costly to deploy individual services for each model. Since we can define the use cases ahead of time, it did seem reasonable to fit the ray serve application into a collection of deployments (per your note), in which a user flag might route through the correct model. For example, you could imagine the case of replicate.com, in which they have many different models, but can route requests through most of the models, so just need to select which one.

To simplify that further – if we could just define a deployment which had a diffusion loaded into it, and every new instantiation of that deployment could have a diffusion model with different weights, that example case would help us a lot.

I am familiar with that RFC, although I understand it is not yet supported on kuberay? A perfect solution for us would be if there was some way to add a deployment into the serve instance (while holding the graph the same, just with another case that would replicate the inputs and outputs), or add another serve instance onto the cluster programmatically.
Will anything like that be possible with the RFC? Or will it still require all the models to be defined in one go, as opposed to through separate job submissions?

In terms of the service creation infrastructure, it’s definitely doable, although would remove a lot of the abstraction that we enjoy from ray. It also would be less than ideal to have numerous different ray dashboards setup to track usage.

Thanks for elaborating on your use case! I think the ask is reasonable– you’d like to create some number of deployments and load a set of different models on to those deployments, perhaps by loading different weights on different replicas and routing requests to the relevant replica.

The multi-app changes probably won’t address this issue directly, although it will let you support different use cases on a single cluster. There’s no out-of-the-box way to programmatically add new deployments, but you should be able to take in Serve configs for new Serve apps, append them to an existing config, and then submit the new config to your cluster.

Could you create a feature request, and link this discussion there? We can track it and see if there’s a more direct way to support this workload.

Created [Serve] Allow loading different weights/versions onto a replica of ray serve deployment · Issue #33107 · ray-project/ray · GitHub to track. One more follow-up question. Would it be possible to just setup a kuberay cluster, and then submit new serve instances through serve.run to the appropriate head ip? Setting a name and route through ray.serve.run — Ray 2.3.0 would allow us to have many different serve instances running on one kuberay cluster correct? I guess the downside seems to be that there needs to be a separate container that runs this command and waits for a shutdown?

That should work, as long as the KubeRay cluster launches a separate Ray cluster for each Serve instance. Running serve run on the same Ray cluster will replace the existing Serve application.

If we pass a name in serve.run will that prevent it deleting the cluster?ray.serve.run — Ray 2.3.0

Ah, I missed your link to Ray 2.3.0. Yes, if you pass in a name to serve.run that’s distinct from other running Serve instances, you can run the new Serve app without tearing down the other pre-existing apps.

Met a similar problem. We want to serve same type of model, but finetuned differently for different users.

I saw following possibilities. Wonder if my understandings are correct.

  1. A multi-app setup is possible in 2.3.0, and we can append use the same deployment class and deploy it with different name, model uri, and router prefix. Dynamic updates are done through calling serve run with different params.
  2. On the other hand, if i use a single Rayservice to do the setup, it seems dynamic update would incur issues:
  • any deployment change in the graph would result in whole graph being reloaded. not ideal for our case.
  • seems to me the deployment names specified in rayservice config needs to be unique? does that imply adding a new model deployment(even if it uses the same deployment class) would need script push?

A separate question related to this: for [RFC] Ray Serve model multiplexing support · Issue #33253 · ray-project/ray · GitHub (mentioned in banjo’s request), how does model multiplexing differ from the multi-app setup mentioned in 1?

Would appreciate if there are other ways to do this.
Thx!