How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
I am running ray cluster with 100 around nodes. In this environment, there is high probability that while cluster is running, due to some issue head node is down. With current scenario, if head node is down, complete cluster is useless. I am using cluster.yaml file for cluster creation and all the nodes are in local network.
To resolve this issue, I was thinking to have two/three (multiple) head node, where if any one is down, another can handle incoming job requests. Is there anyway which can help me to approach this?
OR
Is there anyway, If head node down, all worked node should continue their job, and meanwhile we can attach head node again to cluster?