How to recover job data when using ray service to restart the ray cluster

AndreKuu · June 5, 2023, 1:15pm

Hi guys. I am using ray services function with KubeRay. I hope to enjoy the high availability and zero downtime upgrade for Ray clusters. Here are my queustions:

If some ray jobs still running while ray cluster began to upgrade, is it possible for these jobs to resume and continue executing?
Using external Redis as the backend of gcs can make this happen or not ? Where can I learn more about using external storage backend of ray gcs?
I heard some teams let the job driver run on the nodes other than the head node. How can i do that? And, is this can make resuming jobs possible?

AndreKuu · June 5, 2023, 1:40pm

I found some related document about question 2: Ray GCS Fault Tolerance - KubeRay Docs

Topic		Replies	Views
Preserving Job State After Cluster Restart	1	35	October 31, 2024
Why ray serve need KubeRay to use GCS recover feature? Ray Serve	1	166	March 27, 2024
Job API is very slow when using external redis	3	326	September 26, 2023
Ray Serve Head fault tolerance Ray Serve	3	332	October 13, 2023
Unable to recover from head-pod failure in k8s Ray Clusters	8	826	March 22, 2022

How to recover job data when using ray service to restart the ray cluster

Related topics