Can we customize the behavior of Ray Serve when max_concurrent_requests is reached?

RJ_Lucas · December 29, 2023, 3:44pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I am using Ray Serve to deploy LLAMA 2 13B Chat, using Hugging Face Transformers. I have found that when traffic spikes and lots of inferences are being made concurrently, the GPU starts becoming overloaded. For example, running 2 inferences concurrently and then 2 more concurrently, will be more performant than running all 4 concurrently.

To help alleviate this problem, I am using max_concurrent_requests to limit how many inferences can be in progress at once. However it seems that once this limit is reached, the Serve Proxy will continue to accept new requests and just wait for a replica to become available. This causes timeouts on the client.

What I’d like to do, is be able to reject a request immediately if concurrent requests is maxed out, that way the client making the request can retry after a short wait. I’m hoping that will help spread the load out and lead to higher throughput.

Is this possible out of the box? My idea is to make an Ingress class which contains no model instance, and is responsible for receiving requests. Then it only forwards request to the LLAMA Deployment if concurrent requests is not maxed out. Is it possible for Deployment A to see usage of Deployment B to facilitate this behavior?

Sihan_Wang · December 29, 2023, 5:55pm

Hi @RJ_Lucas , load shedding is not supported yet. can you help to file a ticket for this feature?

We did see the Deployment B usage, but the request will be queued in the Deployment A right now, you have to add your own cancel logic in your code in current ray serve version.

Topic		Replies	Views
Max concurency for deployment Ray Serve	1	1413	June 6, 2022
Why there is no possibility to call more than 100 requests in parallel to Ray Serve? Ray Serve	4	264	January 10, 2024
Scaling Ray serve with vLLM beyond 2 GPUs Ray Serve	1	2403	February 5, 2024
Ray Serve is executing the requests sequentially instead parallel even after configuring auto-scale Ray Serve	11	881	October 20, 2023
Ray serve autoscaling queue size Ray Serve	5	1369	May 24, 2022

Can we customize the behavior of Ray Serve when max_concurrent_requests is reached?

Related topics