Automatically restart head node on kubernetes

I’m running ray (1.4.0) on kubernetes (in GKE) and have problem with our head-node pod being removed. I’m not sure why the pod is removed in the first place, however, with kubernetes one can typically set the restartPolicy=Always to make it restarts after an outage.

I tried setting restartPolicy=Always in our config.yaml (similar to the kubernetes example), but when running ray.init after the head node restarts, I just get the following error:

RuntimeError: Unable to connect to Redis at 10.43.250.14:6379 after 12 retries. Check that 10.43.250.14:6379 is reachable from this machine. If it is not, your firewall may be blocking this port. If the problem is a flaky connection, try setting the environment variable RAY_START_REDIS_WAIT_RETRIES to increase the number of attempts to ping the Redis server.

Is setting restartPolicy=Always supposed to work, or is there another way to make sure the head node stays up?

Hey @simenandresen, thanks a bunch for making this issue!

@tgaddair @Dmitri do you know if this is resolved with the Kopf integration?

The K8s operator is now the recommended tool to launch Ray on K8s. The operator handles head restarts correctly.

https://docs.ray.io/en/master/cluster/kubernetes.html
https://docs.ray.io/en/master/cluster/kubernetes-advanced.html#restart-behavior

Note that a head restart will reboot all ray processes/state – you might want to protect the head pod with a pod disruption budget / pod priority.

Thanks @Dmitri, I’ll try out setting up ray using the ray kubernetes operator