Problem Description:
I am experiencing an issue where the job state is lost after restarting the Ray cluster deployed via KubeRay on Kubernetes. This causes significant disruption as we cannot resume the tasks where they left off, and it requires re-executing the entire workload, leading to inefficiencies and increased computation costs.
Steps to Reproduce:
Deploy a Ray cluster using KubeRay on a Kubernetes environment.
Submit multiple jobs to the Ray cluster.
Restart the Ray cluster (either manually or simulating a failure/recovery scenario).
Observe that the previous job states are not preserved and are lost post-restart.
Expected Behavior:
Post-restart, the Ray cluster should be able to retain or restore the job states so that the jobs can either resume from where they were left or can be conveniently restarted based on the last saved state.