How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
So I’m wrestling with an issue where we have a FastAPI wrapper where a 30s request_timeout_s is appropriate (or even very generous). However, we also have long running jobs that are waiting on model inference where under some circumstances, 60+, or even 120+ seconds is appropriate (think lower end hardware with smaller LLMs with gigantic context windows where a few thousand tokens per second is normal, and so 128k ctx means pushing a bit over a 2 minute response even)
This already caused me a bit of angst, because for various reasons I’d have really preferred to also set different listen ports for different serve deployments; but that I solved by leveraging a mapping a load balancer port into a serve_path; but the timeout issue may create a worse issue.
I suspect I’ll have to get around this by modifying all the FastAPI calls, but wondering if anyone has any insight on how others deal with this? For resource management reasons and operations reasons, do not want two ray instances, and thus two serve instances, which would be another way to address it.