Ray Serve Outages

How severe does this issue affect your experience of using Ray?
High - intermittent outages are causing failures in our application

Hi there, we’ve been self-hosting ray serve on our EKS cluster, and are running into an issue where Ray Serve seems to fail at serving requests for periods of time. I’ve attached a few charts where you can see the average node memory usage and cpu usage spike for a brief period of time while the QPS drops to near 0.

I’m trying to debug what’s going on, my hypothesis right now is that the cluster is trying to scale up, but we are being rate limited by docker when pulling the busybox image.
Another thing I noticed was that the http-proxy actor on the head node seems to be going crazy when this happens ( as you can see in attached screenshot). If I manually kill the head node, it brings back all the services and everything seems to run fine after that (because we have redis fault tolerance set up).
The head node has num-cpus set to 0, so I’m not sure if the http proxy actor should be on the head node at all or not also

Anyone have any thoughts on what could be going wrong here? Would really appreciate some help

What Ray and KubeRay versions are you using?

Ray 2.4.0 and kuberay-operator:0.5.1

Are you seeing the HTTP Proxy usage spike across all proxies, or just the one on the head node?

The head node has num-cpus set to 0, so I’m not sure if the http proxy actor should be on the head node at all or not also

The proxy doesn’t reserve any CPUs, so it would still run on the head node in this case. This is expected behavior and shouldn’t cause any issues.

I wonder if the root cause is that too many requests are queuing up on the proxy and aren’t being cleared. If the only proxy that’s being overloaded is the one on the head node, that might explain why restarting the head node fixes the issue.

Is there any chance you could retry this on the latest Ray nightly version? We’ve upgraded Ray Serve’s HTTP handling recently, so those upgrades might fix the issue.

This is our prod cluster, so I’m a little hesitant to upgrade to nightly just yet - but we just made a change that will hopefully unblock our autoscaling - we moved busybox image to our own ecr repo. If that doesn’t work I’ll give the nightly build a try

Oh I see, I’d recommend not upgrading the prod cluster to nightly since that could raise other issues. Instead, it would be better to set up smaller cluster, just to develop on, and reproduce the error on there.