Ray Serve Head fault tolerance

ammarck · September 11, 2023, 10:13pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hello,

I am evaluating using Ray Serve for the company I work for. I am running a scenario where we run a Ray Serve application with 3 replicas (using kuberay RayService) and we kill the head node.

With this scenario, the expectation I had was that the Ray Serve application would still be served by the worker node and the head node would recover.

However, actually what we saw was that the Ray Serve application did not work anymore. The head node came back, but it didn’t have the Ray Serve application on it. Is that expected?

Jules_Damji · September 12, 2023, 6:32pm

@ammarck We are a bit heads down because of Ray Summit next week. So replies and responses will be delayed.

Here is some docs on Ray fault tolerance:
https://docs.ray.io/en/latest/serve/production-guide/fault-tolerance.html#add-end-to-end-fault-tolerance

ammarck · September 12, 2023, 8:23pm

@Jules_Damji Thank you for your response. Yeah I saw that document before actually. It doesn’t say anything about what happens if the head node (or pod) fails without having fault tolerance.

shrekris · October 13, 2023, 5:04pm

Hi @ammarck, welcome to the forums! We discussed this further on the Ray Slack, but for future reference, enabling Redis fault tolerance is required for the head node to recover and the application to continue serving requests while the head node is down.

The expected behavior without it is that the cluster will eventually crash when the head node fails, and KubeRay will restart the cluster.

Topic		Replies	Views
Rayserve fault tolerance Ray Serve	0	38	October 22, 2024
High availability for Ray Serve in 2022 (head node) Ray Serve	3	1374	September 1, 2022
Why ray serve need KubeRay to use GCS recover feature? Ray Serve	1	168	March 27, 2024
[Cluster, Serve] Is it possible to configure cluster fault tolerance without `ray up`? Ray Clusters	0	158	January 11, 2024
Start cluster with multiple head node Ray Core	4	1012	February 22, 2023

Ray Serve Head fault tolerance

Related topics