Automatically restart head node on kubernetes

simenandresen · June 18, 2021, 2:42pm

I’m running ray (1.4.0) on kubernetes (in GKE) and have problem with our head-node pod being removed. I’m not sure why the pod is removed in the first place, however, with kubernetes one can typically set the restartPolicy=Always to make it restarts after an outage.

I tried setting restartPolicy=Always in our config.yaml (similar to the kubernetes example), but when running ray.init after the head node restarts, I just get the following error:

RuntimeError: Unable to connect to Redis at 10.43.250.14:6379 after 12 retries. Check that 10.43.250.14:6379 is reachable from this machine. If it is not, your firewall may be blocking this port. If the problem is a flaky connection, try setting the environment variable RAY_START_REDIS_WAIT_RETRIES to increase the number of attempts to ping the Redis server.

Is setting restartPolicy=Always supposed to work, or is there another way to make sure the head node stays up?

rliaw · June 22, 2021, 4:24pm

Hey @simenandresen, thanks a bunch for making this issue!

@tgaddair @Dmitri do you know if this is resolved with the Kopf integration?

Dmitri · June 22, 2021, 5:16pm

The K8s operator is now the recommended tool to launch Ray on K8s. The operator handles head restarts correctly.

https://docs.ray.io/en/master/cluster/kubernetes.html
https://docs.ray.io/en/master/cluster/kubernetes-advanced.html#restart-behavior

Note that a head restart will reboot all ray processes/state – you might want to protect the head pod with a pod disruption budget / pod priority.

simenandresen · June 24, 2021, 6:03am

Thanks @Dmitri, I’ll try out setting up ray using the ray kubernetes operator

Topic		Replies	Views
Unable to recover from head-pod failure in k8s Ray Clusters	8	828	March 22, 2022
Head pod does not restart after deleting/draining Kubernetes	7	796	August 9, 2022
Ray controller restart worker pod after head pod restart Kubernetes	0	395	November 19, 2023
Can we get docker restart policy set to always for head node? Ray Core	2	274	November 17, 2021
Autoscaler does not seem to watch head node Kubernetes	5	736	March 26, 2021

Automatically restart head node on kubernetes

Related topics