Ray Serve Head fault tolerance

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hello,

I am evaluating using Ray Serve for the company I work for. I am running a scenario where we run a Ray Serve application with 3 replicas (using kuberay RayService) and we kill the head node.

With this scenario, the expectation I had was that the Ray Serve application would still be served by the worker node and the head node would recover.

However, actually what we saw was that the Ray Serve application did not work anymore. The head node came back, but it didn’t have the Ray Serve application on it. Is that expected?

@ammarck We are a bit heads down because of Ray Summit next week. So replies and responses will be delayed.

Here is some docs on Ray fault tolerance:
https://docs.ray.io/en/latest/serve/production-guide/fault-tolerance.html#add-end-to-end-fault-tolerance

@Jules_Damji Thank you for your response. Yeah I saw that document before actually. It doesn’t say anything about what happens if the head node (or pod) fails without having fault tolerance.

Hi @ammarck, welcome to the forums! We discussed this further on the Ray Slack, but for future reference, enabling Redis fault tolerance is required for the head node to recover and the application to continue serving requests while the head node is down.

The expected behavior without it is that the cluster will eventually crash when the head node fails, and KubeRay will restart the cluster.