How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I am running a ray cluster with around 100 nodes. I am using cluster.yaml file to run the cluster. In our environment, there are probability that due to some reason, head node is down or crashed. We have seen that, while cluster is running and head node is down, all the worker nodes also stopped and due to this the jobs are being executed by worker node are stopped.
Is there any way, which is kind of graceful exit from cluster, such that if head node down, if worker node is running any job, it should complete the job and then exit.