High Availability for Head node of Ray clusters

Hi Ray team!
Hope you all are doing well :smiley:

I wanted to understand if there was a way to avoid single-point of failure in Ray clusters. Since a cluster has only 1 head node, if the head node disappears, then the cluster will go down.

1 Like

Hey @akshat-rippling sorry for the late reply!

Right now, there’s no obvious way to avoid the single point failure because we have a concept of a head node (and the central metadata is in that node). We’ve been working on removing that path, but it’s not been the highest priority, and I cannot guarantee when it will be available. For now, the best solution is to operate multiple cluster & checkpointing some important states!