High availability for Ray Serve in 2022 (head node)

Hello, do you have news on the high availability capabilities of ray serve, particularly on the SPOF of the head node?
The posts I find on this forum are from 1+ year ago (High Availability for Head node of Ray clusters, Highly available head node?, /t/can-we-get-docker-restart-policy-set-to-always-for-head-node/4119), is there more support now?
If the consensus is still ‘create multiple cluster and synchronize some state’, do you have more details for what is this state, is there some example implementation available?
The Ant Ray Serve blogpost talks about multi cluster ray serve but it’s seems it breaks quite a lot of the features of ray serve and we need to create custom services to manage all of this between multiple clusters (service discovery when clusters are not the same, synchronize deployment, orchestration, auto-scaling, …).

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it (maybe).

Here are some links describing features in Ray 2.0 (you can try it out before the release with `pip install “ray[serve, default]==2.0.0rc1”:

Please let us know if you have questions about this!

Hello, the GCS HA link is down, I suppose it’s Ray GCS FT - KubeRay Docs (ray-project.github.io) now.
The change is that Ray Serve can now use an external (HA) Redis cluster for state, correct? It’s not clear what the role of the head node is. When the head node fails, do the workers reconnect to the new head node?

When the head node fails, the workers will continue to Serve traffic and talk to each other. When the head node comes back, the workers will re-connect. The role of head node in the HA scenario is to manage the creation and scaling up/down of the deployments. When the head node down, no scaling operation can be performed but the traffic can still go through worker nodes as usual.