Graceful Exit from Cluster

shyampatel · February 20, 2023, 6:58am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I am running a ray cluster with around 100 nodes. I am using cluster.yaml file to run the cluster. In our environment, there are probability that due to some reason, head node is down or crashed. We have seen that, while cluster is running and head node is down, all the worker nodes also stopped and due to this the jobs are being executed by worker node are stopped.

Is there any way, which is kind of graceful exit from cluster, such that if head node down, if worker node is running any job, it should complete the job and then exit.

yic · February 21, 2023, 6:25pm

@shyampatel you need to check GCS FT and make sure the head node is alive. Jobs needs communication with the head node to work.

https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/experimental.html#gcs-fault-tolerance

shyampatel · March 2, 2023, 11:42am

@yic Thanks for your reply. As of now we are maintaining on-premise servers, so can you suggest a way to achieve same with that.

yic · March 7, 2023, 10:53pm

I don’t think it’s trivial. You need a redis HA cluster. You can pass the address to RAY_REDIS_ADDRESS.

But there might be some thing you need to tune a little bit. You can check KubeRay’s implementation for some ideas.

Topic		Replies	Views
Start cluster with multiple head node Ray Core	4	987	February 22, 2023
Ray Serve Head fault tolerance Ray Serve	3	338	October 13, 2023
Is there a way to stop or delete the head node once the job is done? Ray Clusters	5	2065	June 15, 2022
Ray cluster raylet is down but the worker doesn't come back up Ray Clusters	1	410	November 3, 2022
Ray worker behaviour Ray Core	8	610	April 10, 2023

Graceful Exit from Cluster

Related topics