Preserving Job State After Cluster Restart

xiaoming12306 · October 30, 2024, 1:57am

Problem Description:

I am experiencing an issue where the job state is lost after restarting the Ray cluster deployed via KubeRay on Kubernetes. This causes significant disruption as we cannot resume the tasks where they left off, and it requires re-executing the entire workload, leading to inefficiencies and increased computation costs.

Steps to Reproduce:

Deploy a Ray cluster using KubeRay on a Kubernetes environment.

Submit multiple jobs to the Ray cluster.

Restart the Ray cluster (either manually or simulating a failure/recovery scenario).

Observe that the previous job states are not preserved and are lost post-restart.

Expected Behavior:

Post-restart, the Ray cluster should be able to retain or restore the job states so that the jobs can either resume from where they were left or can be conveniently restarted based on the last saved state.

Kai-Hsun_Chen · October 31, 2024, 6:38pm

Hi @xiaoming12306,

Currently, Ray doesn’t support this feature for RayCluster. You might want to take a look at the RayJob CRD instead. The RayJob CRD provides a suspend API to suspend the underlying RayCluster. After that, you can update the RayJob CR spec and then set suspend back to false.

Btw, we have #kuberay-questions channel in Ray Slack. KubeRay maintainers are actively monitoring the channel. Feel free to post questions on the channel and interact with other KubeRay users!

Topic		Replies	Views
How to recover job data when using ray service to restart the ray cluster Kubernetes	1	577	June 5, 2023
How to persist the state of ray cluster Ray Clusters	0	365	November 10, 2022
Ray controller restart worker pod after head pod restart Kubernetes	0	397	November 19, 2023
How to preserve state of ray serve on ray cluster restart? Ray Serve	0	443	May 4, 2021
Persist ray job logs after restarting cluster Ray Clusters	0	489	November 17, 2023

Preserving Job State After Cluster Restart

Related topics