Ray controller restart worker pod after head pod restart

maghsood_esmaeili · November 19, 2023, 8:10am

Hello,
I’m a DevOps engineer, and I’ve initiated a Ray cluster with KubeRay for ml-teams. I’m encountering significant challenges with KubeRay.

I’m testing the high availability of the Ray cluster. I want Ray workers to continue working (my running tasks) even after the head pod is deleted (the operator will restart it), but this is not happening. Instead, the worker pod restarts, causing all worker pods to restart, and my running jobs stop.

I need to run some jobs that may run for days or weeks, and I need to ensure that these jobs run successfully even if the head node fails.

After deleting the head pod, the operator displays the following error:

Got error when updating status
"error": "Operation cannot be fulfilled on rayclusters.ray.io \"raycluster-gpu\": the object has been modified; please apply your changes to the latest version and try again"

I’m new to Ray and Kuberay. What should I need to do for this use case?

Topic		Replies	Views
Head pod does not restart after deleting/draining Kubernetes	7	796	August 9, 2022
Unable to recover from head-pod failure in k8s Ray Clusters	8	828	March 22, 2022
KubeRay operator keep restarting Kubernetes	13	2787	October 7, 2022
Automatically restart head node on kubernetes Kubernetes	3	852	June 24, 2021
WorkerPods Unexpectedly Restart When Injecting Failures into Ray HeadPod with GCS FT Enabled on K8S Ray Core	1	26	April 25, 2025

Ray controller restart worker pod after head pod restart

Related topics