Production best practices for Ray Serve

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hey folks, I’m looking for some guidance on production best practices. To give some context, I have a bunch of independent models, deployed on different endpoints using DAGDriver.bind({"/model1": Model1.bind(), "/model2": Model2.bind(), …})

Looking at this documentation, if I make code changes to one of these models, a new cluster is started and the previous one is terminated requiring the Kubernetes cluster to be large enough to schedule both Ray Clusters. Having to teardown and setup all model deployments when changes are made to just one seems unnecessary to me, and it also raises scalability concerns if the Kubernetes cluster isn’t large enough.

Is there a better way to approach this to avoid teardown + setup of all deployments?

cc: @Sihan_Wang @shrekris for ideas

Hi @smitkiri-klaviyo, welcome the forums!

If you bind your models into a single graph, and you issue a code update to one of them, this will cause a rolling update to all the models. Graphs are deployed as a unit, and their upgrades happen as a unit.

Serve has recently added experimental support for running multiple apps on a single Ray cluster. You can see the RFC here, and the recent changes here. We plan to add support for this in the REST API in the coming weeks. This will let you use this support with KubeRay RayServices.

If you use the nightly version of Ray, you can launch multiple apps on a single Ray cluster by running serve.run(your_graph, name=…) where name is the application name. You can split each of your models into a separate applications, and when you upgrade an application, only that model will be upgraded. Please note that this API is still experimental and may change in the future.

1 Like

Thank you! If I understand correctly, binding all models to a single graph is the only way to deploy multiple models on a single endpoint, right? But it sounds like it’s being addressed in the RFC, really looking forward to it.

With multiple applications on a single ray cluster, are there any concerns / guidance on how many applications we can / should run on a single ray cluster?

Yeah, currently binding all the models into a single graph is the only way to deploy them onto a single Ray cluster.

After the RFC is addressed, you should be able to split each model into its own graph and deploy all the graphs onto a single Ray cluster. Each graph would be independently upgradeable.

With multiple applications on a single ray cluster, are there any concerns / guidance on how many applications we can / should run on a single ray cluster?

Generally, for production, we’d recommend each Ray cluster support a single use case. If all your models are totally independent, it may be better to run them on different clusters for better fault tolerance and isolation. If you’d prefer, you should still be able to run them on a single cluster without hitting any scaling limits. However, if the cluster goes down, this would affect all the deployments running on it.

1 Like

Hello @shrekris

I’m also concerned about this issue. If I create different ray clusters of different applications, each of them only run at 1 or 2 Replicas. There would be hundreds of clusters in one K8S cluster.

Is it recommended to run one small application in a ray clusters? Is there a more appropriate way to use in this case?

Thank you so much!

Hi @chaolin, at this point we’ve introduced multi-app support, so multiple Serve apps can run in a single Ray cluster. Since you seem to be running many small apps, you can use multi-app to pack them into one cluster.