I’m a DevOps engineer, and I’ve initiated a Ray cluster with KubeRay for ml-teams. I’m encountering significant challenges with KubeRay.
I’m testing the high availability of the Ray cluster. I want Ray workers to continue working (my running tasks) even after the head pod is deleted (the operator will restart it), but this is not happening. Instead, the worker pod restarts, causing all worker pods to restart, and my running jobs stop.
I need to run some jobs that may run for days or weeks, and I need to ensure that these jobs run successfully even if the head node fails.
After deleting the head pod, the operator displays the following error:
Got error when updating status "error": "Operation cannot be fulfilled on rayclusters.ray.io \"raycluster-gpu\": the object has been modified; please apply your changes to the latest version and try again"
I’m new to Ray and Kuberay. What should I need to do for this use case?