I’m running ray (1.4.0) on kubernetes (in GKE) and have problem with our head-node pod being removed. I’m not sure why the pod is removed in the first place, however, with kubernetes one can typically set the restartPolicy=Always to make it restarts after an outage.
setting restartPolicy=Always in our config.yaml (similar to the kubernetes example), but when running
ray.init after the head node restarts, I just get the following error:
RuntimeError: Unable to connect to Redis at 10.43.250.14:6379 after 12 retries. Check that 10.43.250.14:6379 is reachable from this machine. If it is not, your firewall may be blocking this port. If the problem is a flaky connection, try setting the environment variable RAY_START_REDIS_WAIT_RETRIES
to increase the number of attempts to ping the Redis server.
restartPolicy=Always supposed to work, or is there another way to make sure the head node stays up?