Rayserve fault tolerance

Sowmith_Renumakala · October 22, 2024, 12:39pm

Hello team

We have a setup of rayservice, with headnode connected to redis for fault tolerance of GCS, and is having “num-cpus” as 0, to not handle any traffic. All head and workers connected by istio mesh

I have observed this scenario that when rayservice head goes down for more than 10mins, lets say for 30mins. The ray workers continue to keep handling the requests for 10mins (configured server at EveryNode), after which rayworkers reject the requests, but are alive.

After 30 mins, when head service is back up, there is a huge spike in the requests of head node and to worker nodes. During this time error qps increases, real time requests are rejected, and takes a while for errors and latency to go down and handle the real time traffic.
There are no client side retries.

During this time i see the below api is called from head
http://:10002/ray.rpc.CoreWorkerService/PushTask

Questions are

When head node is down and workers are down, are there any request queueing up happening, if so where ?
why arent the worker nodes down, they are still alive after loosing connection with head node after 10mins, after 30mins they take the traffic ?
How do i prevent that, when head node is up, i want it to serve the real traffic rather than serving stale requests while our request time out is 50ms.

Thanks

Topic		Replies	Views
Ray Serve Head fault tolerance Ray Serve	3	339	October 13, 2023
Ray Serve Outages Ray Serve	5	414	July 7, 2023
Questions about fault tolerance in a Ray cluster Ray Clusters	0	416	December 15, 2021
High availability for Ray Serve in 2022 (head node) Ray Serve	3	1370	September 1, 2022
Newbi Question: Worker Fault Tolerance?	4	561	February 28, 2022

Rayserve fault tolerance

Related topics