Graceful Exit from Cluster

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am running a ray cluster with around 100 nodes. I am using cluster.yaml file to run the cluster. In our environment, there are probability that due to some reason, head node is down or crashed. We have seen that, while cluster is running and head node is down, all the worker nodes also stopped and due to this the jobs are being executed by worker node are stopped.

Is there any way, which is kind of graceful exit from cluster, such that if head node down, if worker node is running any job, it should complete the job and then exit.

@shyampatel you need to check GCS FT and make sure the head node is alive. Jobs needs communication with the head node to work.

https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/experimental.html#gcs-fault-tolerance

@yic Thanks for your reply. As of now we are maintaining on-premise servers, so can you suggest a way to achieve same with that.

I don’t think it’s trivial. You need a redis HA cluster. You can pass the address to RAY_REDIS_ADDRESS.

But there might be some thing you need to tune a little bit. You can check KubeRay’s implementation for some ideas.