Avoid downtime when updating script dependencies

ezorita · November 15, 2022, 3:55pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Context:
I have a general question on cluster and deployment architecture. My system exposes a few ML models and algorithms through a Ray REST API. I want these models to be always available as Actors to compute and respond whenever a request is received. My plan is to build a Docker image containing the model code, model checkpoints, and all python dependencies. Dependencies and models get updated frequently, so that Docker image must be rebuild and deployed from time to time.

Question:
I wonder what’s the best way to design the cluster config and deployment flow to minimize downtime. My plan is to deploy this in an autoscaling kubernetes cluster, but I am unsure on what’s the best strategy to redeploy Docker images and register Actors that already exist while avoiding downtime? Does Ray support this natively or is there any tricks I’d need to implement on my own?

Many thanks!

shrekris · November 17, 2022, 10:10pm

Hi @ezorita, welcome to the Ray forums!

I’d recommend trying Ray Serve, Ray’s model serving library. You can wrap your ML models in @serve.deployment decorators, which can convert your ML models into sets of replicated Ray actors.

KubeRay offers a K8s operator that lets you run your Ray Serve deployments on your Kubernetes cluster. The operator offers fault tolerance, high availability, and zero-downtime updates, which should fit your needs. Please feel free to follow up with any questions!

ezorita · December 22, 2022, 1:46pm

Thanks for your answer! My question applies concretely on the zero-downtime update for the ray k8s operator. I could not find in the documentation how one would proceed to update the underlying docker image without downtime, since kuberay does not expose a deployment.

shrekris · December 22, 2022, 5:43pm

You can indeed update the underlying Docker image without downtime.

When you update your RayService config’s rayClusterConfig with new Docker images, the RayService operator will spin up a new Ray cluster with the new Docker images. Meanwhile, it’ll continue serving traffic using the existing Ray cluster. Once your Serve deployments are ready on the new Ray cluster, the operator will switch traffic to the new cluster and then delete the old cluster. There should be no downtime.

Topic		Replies	Views
Ray serve on Kubernetes Ray Serve	14	934	March 27, 2024
Ray Serve for production usecase Ray Serve	3	719	May 25, 2022
Dynamically serve new model via Ray Serve Ray Serve	5	71	June 11, 2025
Deploy Ray Serve on K8s Cluster Ray Serve	1	1113	November 10, 2022
[Cluster, Serve] Is it possible to configure cluster fault tolerance without `ray up`? Ray Clusters	0	158	January 11, 2024

Avoid downtime when updating script dependencies

Related topics