I’m running ray (1.4.0) on kubernetes (in GKE) and have problem with our head-node pod being removed. I’m not sure why the pod is removed in the first place, however, with kubernetes one can typically set the restartPolicy=Always to make it restarts after an outage.
I tried setting restartPolicy=Always
in our config.yaml (similar to the kubernetes example), but when running ray.init
after the head node restarts, I just get the following error:
RuntimeError: Unable to connect to Redis at 10.43.250.14:6379 after 12 retries. Check that 10.43.250.14:6379 is reachable from this machine. If it is not, your firewall may be blocking this port. If the problem is a flaky connection, try setting the environment variable
RAY_START_REDIS_WAIT_RETRIES to increase the number of attempts to ping the Redis server.
Is setting restartPolicy=Always
supposed to work, or is there another way to make sure the head node stays up?