1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.44
- Python version: 3.10
- OS: Linux (From original ray docker image)
- Cloud/Infrastructure: On premise Kubernetes Cluster Single Node
- Other libs/tools (if relevant):
3. What happened vs. what you expected:
- Expected: When i call the serve endpoint with a request i will get 200 (or 202) immediately when there are free workers (max_ongoing_request * num_workers) and when the queue is not full (max_queued_request). 503 immediately when the queue is full.
- Actual: I’m getting 200 immediately when workers are free, when a call is in the queue i will wait till a worker become free (around 60s) and then i’ll get a 200. When the queue is full i will get 503 immediately.
I’m using AsyncIO.create_task to try to make it async.
Hi @alexxover, welcome to the community! If all replicas are busy handling requests, the request will be queued (up to max_queued_requests
will be queued). The request will need to be dequeued and executed by a replica before 200 is returned. Depending on your app’s throughput, this could mean that the request takes ~60s to finish. Please let me know if I’m misunderstanding your expected or actual behavior.
Dear Alex,
Thank you for the welcome and for your help
Yes, at the moment is working exactly as you carefully described. In case of a peak of requests can take up to 20 minutes depending on my current config.
The expected way is that i would like to receive 200 and not hanging when the request is dequeued because I’m expecting is processed asynchronously (indeed I forgot to say I have an endpoint to poll for job completion).
The BackPressureError behaviour should be kept as is.
Hi @alexxover, so your application is just polling if a job is complete, e.g. you expect something like (200, “DONE”) or (200, “IN_PROGRESS”) immediately? In this case, it seems like the long request duration could be due to how the application is deciding the status of the requested job. Could you also share what your avg latency without queueing, max_ongoing_requests
, num_replicas
, and max_queued_request
numbers are?
Hey @alexyang !
Yes i’m expecting that my application will give me back 200 or 503 immediately according to the queue status (503 when queue is exhausted, 200 otherwise).
On another side, i have an API to poll the job status, so the submission shouldn’t wait for it.
The average time of response when a job is not queued is around 0.05 ms, when the response is queued the response come when the job is assigned to a worker.
My actual numbers are max_ongoing_requests = 1, num_replicas = 4, and max_queued_request = 2. In this demo setup before scaling the queued requests will get a response after about 70s (the time to complete one job assigned to a worker)
This is my serve deployment:
class ServiceHandler:
def __init__(self):
......
async def start_processing(self, job_id: str, prev_id: Optional[str] = None):
await self.process_data(job_id, prev_id)
async def __call__(self, http_request: Request) -> dict:
req_data: dict = await http_request.json()
job_id = req_data['job_id']
self.job_id = job_id
try:
loop = asyncio.get_running_loop()
except RuntimeError:
loop = None
if loop and loop.is_running():
asyncio.create_task(self.start_processing(job_id))
else:
asyncio.run(self.start_processing(job_id))
return {"job_id": self.job_id}
Hey @alexxover, I would first double check that process_data
is not blocking the event loop as that could prevent expected concurrency from happening even if you’re using async.