Start cluster with multiple head node

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am running ray cluster with 100 around nodes. In this environment, there is high probability that while cluster is running, due to some issue head node is down. With current scenario, if head node is down, complete cluster is useless. I am using cluster.yaml file for cluster creation and all the nodes are in local network.

To resolve this issue, I was thinking to have two/three (multiple) head node, where if any one is down, another can handle incoming job requests. Is there anyway which can help me to approach this?

OR

Is there anyway, If head node down, all worked node should continue their job, and meanwhile we can attach head node again to cluster?

I think this is very important feature,the HA of head node seems not finished now.
I did some search and find a git issue : [RFC] GCS High availability · Issue #20498 · ray-project/ray (github.com)
maybe this still need a lot of work?

1 Like

GCS FT is supported in KubeRay. You can also just bring a redis cluster and if the head node is down, just restart it.

https://docs.ray.io/en/latest/serve/production-guide/fault-tolerance.html#step-2-add-redis-info-to-rayservice

Btw, do you mind showing the error logs why the head node crashed?

The HA GCS is more complicated than FT GCS which will be a long term project.

1 Like

@yic Thanks for your reply.

It’s not about head node crashing. For our case, sometimes, system running head node itself was down due to environmental issues.

Got it. I think FT GCS is what you need then.