Hello, do you have news on the high availability capabilities of ray serve, particularly on the SPOF of the head node?
The posts I find on this forum are from 1+ year ago (High Availability for Head node of Ray clusters, Highly available head node?, /t/can-we-get-docker-restart-policy-set-to-always-for-head-node/4119), is there more support now?
If the consensus is still ‘create multiple cluster and synchronize some state’, do you have more details for what is this state, is there some example implementation available?
The Ant Ray Serve blogpost talks about multi cluster ray serve but it’s seems it breaks quite a lot of the features of ray serve and we need to create custom services to manage all of this between multiple clusters (service discovery when clusters are not the same, synchronize deployment, orchestration, auto-scaling, …).
How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it (maybe).