High availability for Ray Serve in 2022 (head node)

artemisart · August 16, 2022, 3:31pm

Hello, do you have news on the high availability capabilities of ray serve, particularly on the SPOF of the head node?
The posts I find on this forum are from 1+ year ago (High Availability for Head node of Ray clusters, Highly available head node?, /t/can-we-get-docker-restart-policy-set-to-always-for-head-node/4119), is there more support now?
If the consensus is still ‘create multiple cluster and synchronize some state’, do you have more details for what is this state, is there some example implementation available?
The Ant Ray Serve blogpost talks about multi cluster ray serve but it’s seems it breaks quite a lot of the features of ray serve and we need to create custom services to manage all of this between multiple clusters (service discovery when clusters are not the same, synchronize deployment, orchestration, auto-scaling, …).

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it (maybe).

architkulkarni · August 16, 2022, 6:12pm

Here are some links describing features in Ray 2.0 (you can try it out before the release with `pip install “ray[serve, default]==2.0.0rc1”:

Please let us know if you have questions about this!

artemisart · August 23, 2022, 5:03pm

Hello, the GCS HA link is down, I suppose it’s Ray GCS FT - KubeRay Docs (ray-project.github.io) now.
The change is that Ray Serve can now use an external (HA) Redis cluster for state, correct? It’s not clear what the role of the head node is. When the head node fails, do the workers reconnect to the new head node?

simon-mo · September 1, 2022, 9:58pm

When the head node fails, the workers will continue to Serve traffic and talk to each other. When the head node comes back, the workers will re-connect. The role of head node in the HA scenario is to manage the creation and scaling up/down of the deployments. When the head node down, no scaling operation can be performed but the traffic can still go through worker nodes as usual.

Topic		Replies	Views
Ray Serve Head fault tolerance Ray Serve	3	344	October 13, 2023
Start cluster with multiple head node Ray Core	4	1012	February 22, 2023
Rayserve fault tolerance Ray Serve	0	38	October 22, 2024
High Availability for Head node of Ray clusters Ray Clusters	1	756	June 5, 2021
Why ray serve need KubeRay to use GCS recover feature? Ray Serve	1	168	March 27, 2024

High availability for Ray Serve in 2022 (head node)

Related topics