How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I am using Ray Serve to deploy LLAMA 2 13B Chat, using Hugging Face Transformers. I have found that when traffic spikes and lots of inferences are being made concurrently, the GPU starts becoming overloaded. For example, running 2 inferences concurrently and then 2 more concurrently, will be more performant than running all 4 concurrently.
To help alleviate this problem, I am using
max_concurrent_requests to limit how many inferences can be in progress at once. However it seems that once this limit is reached, the Serve Proxy will continue to accept new requests and just wait for a replica to become available. This causes timeouts on the client.
What I’d like to do, is be able to reject a request immediately if concurrent requests is maxed out, that way the client making the request can retry after a short wait. I’m hoping that will help spread the load out and lead to higher throughput.
Is this possible out of the box? My idea is to make an Ingress class which contains no model instance, and is responsible for receiving requests. Then it only forwards request to the LLAMA Deployment if concurrent requests is not maxed out. Is it possible for Deployment A to see usage of Deployment B to facilitate this behavior?