The pains of global HTTPOptions in ray.serve

asyncx · August 13, 2024, 11:58am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

So I’m wrestling with an issue where we have a FastAPI wrapper where a 30s request_timeout_s is appropriate (or even very generous). However, we also have long running jobs that are waiting on model inference where under some circumstances, 60+, or even 120+ seconds is appropriate (think lower end hardware with smaller LLMs with gigantic context windows where a few thousand tokens per second is normal, and so 128k ctx means pushing a bit over a 2 minute response even)

This already caused me a bit of angst, because for various reasons I’d have really preferred to also set different listen ports for different serve deployments; but that I solved by leveraging a mapping a load balancer port into a serve_path; but the timeout issue may create a worse issue.

I suspect I’ll have to get around this by modifying all the FastAPI calls, but wondering if anyone has any insight on how others deal with this? For resource management reasons and operations reasons, do not want two ray instances, and thus two serve instances, which would be another way to address it.

Topic		Replies	Views
Ray Serve with FastAPI slowing down performance Ray Serve	1	504	July 19, 2023
Low througput and not able to scale with ray serve Ray Serve	1	35	May 6, 2025
Ray Serve with Fast API and Serve batch- Client Request cancellation RLlib	0	72	January 3, 2025
[kuberay serve] 'request_timeout_s' not support for ray serve deployment in k8s Kubernetes	0	337	September 15, 2023
Ray serve create endpoint timeout Ray Core	2	632	March 11, 2021

The pains of global HTTPOptions in ray.serve

Related topics