Ray Serve http queued call hangs if workers are busy

alexxover · March 28, 2025, 5:45pm

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

Ray version: 2.44
Python version: 3.10
OS: Linux (From original ray docker image)
Cloud/Infrastructure: On premise Kubernetes Cluster Single Node
Other libs/tools (if relevant):

3. What happened vs. what you expected:

Expected: When i call the serve endpoint with a request i will get 200 (or 202) immediately when there are free workers (max_ongoing_request * num_workers) and when the queue is not full (max_queued_request). 503 immediately when the queue is full.
Actual: I’m getting 200 immediately when workers are free, when a call is in the queue i will wait till a worker become free (around 60s) and then i’ll get a 200. When the queue is full i will get 503 immediately.

I’m using AsyncIO.create_task to try to make it async.

alexyang · March 28, 2025, 10:50pm

Hi @alexxover, welcome to the community! If all replicas are busy handling requests, the request will be queued (up to max_queued_requests will be queued). The request will need to be dequeued and executed by a replica before 200 is returned. Depending on your app’s throughput, this could mean that the request takes ~60s to finish. Please let me know if I’m misunderstanding your expected or actual behavior.

alexxover · April 1, 2025, 11:37pm

Dear Alex,
Thank you for the welcome and for your help

Yes, at the moment is working exactly as you carefully described. In case of a peak of requests can take up to 20 minutes depending on my current config.

The expected way is that i would like to receive 200 and not hanging when the request is dequeued because I’m expecting is processed asynchronously (indeed I forgot to say I have an endpoint to poll for job completion).

The BackPressureError behaviour should be kept as is.

alexyang · April 2, 2025, 5:37pm

Hi @alexxover, so your application is just polling if a job is complete, e.g. you expect something like (200, “DONE”) or (200, “IN_PROGRESS”) immediately? In this case, it seems like the long request duration could be due to how the application is deciding the status of the requested job. Could you also share what your avg latency without queueing, max_ongoing_requests, num_replicas, and max_queued_request numbers are?

alexxover · April 3, 2025, 9:23am

Hey @alexyang !
Yes i’m expecting that my application will give me back 200 or 503 immediately according to the queue status (503 when queue is exhausted, 200 otherwise).
On another side, i have an API to poll the job status, so the submission shouldn’t wait for it.
The average time of response when a job is not queued is around 0.05 ms, when the response is queued the response come when the job is assigned to a worker.

My actual numbers are max_ongoing_requests = 1, num_replicas = 4, and max_queued_request = 2. In this demo setup before scaling the queued requests will get a response after about 70s (the time to complete one job assigned to a worker)

This is my serve deployment:

class ServiceHandler:
    def __init__(self):
        ......

    async def start_processing(self, job_id: str, prev_id: Optional[str] = None):
        await self.process_data(job_id, prev_id)

    async def __call__(self, http_request: Request) -> dict:
        req_data: dict = await http_request.json()
        job_id = req_data['job_id']
        self.job_id = job_id
        try:
            loop = asyncio.get_running_loop()
        except RuntimeError:
            loop = None

        if loop and loop.is_running():
            asyncio.create_task(self.start_processing(job_id))
        else:
            asyncio.run(self.start_processing(job_id))
        return {"job_id": self.job_id}

alexyang · April 17, 2025, 5:37pm

Hey @alexxover, I would first double check that process_data is not blocking the event loop as that could prevent expected concurrency from happening even if you’re using async.

Topic		Replies	Views
No request can complete until all requests are ready Ray Serve	14	722	November 14, 2023
Batching doesn't work: requests are processed one by one Ray Serve	2	617	June 19, 2021
Help debugging blocked serve deployment Ray Serve	1	615	March 7, 2022
Using asyncio to process HTTP requests concurrently Ray Serve	2	510	August 3, 2021
Concurrent queries blocking following queries Ray Serve	2	577	November 22, 2021

Ray Serve http queued call hangs if workers are busy

Related topics