Preserving Job State After Cluster Restart

Problem Description:

I am experiencing an issue where the job state is lost after restarting the Ray cluster deployed via KubeRay on Kubernetes. This causes significant disruption as we cannot resume the tasks where they left off, and it requires re-executing the entire workload, leading to inefficiencies and increased computation costs.

Steps to Reproduce:

Deploy a Ray cluster using KubeRay on a Kubernetes environment.

Submit multiple jobs to the Ray cluster.

Restart the Ray cluster (either manually or simulating a failure/recovery scenario).

Observe that the previous job states are not preserved and are lost post-restart.

Expected Behavior:

Post-restart, the Ray cluster should be able to retain or restore the job states so that the jobs can either resume from where they were left or can be conveniently restarted based on the last saved state.

Hi @xiaoming12306,

Currently, Ray doesn’t support this feature for RayCluster. You might want to take a look at the RayJob CRD instead. The RayJob CRD provides a suspend API to suspend the underlying RayCluster. After that, you can update the RayJob CR spec and then set suspend back to false.

Btw, we have #kuberay-questions channel in Ray Slack. KubeRay maintainers are actively monitoring the channel. Feel free to post questions on the channel and interact with other KubeRay users!