We have a Ray Serve application which contains multiple ML models deployed primarily for computer vision application. The individual latencies of the Actors are quite low(~100 ms).
Each image is processed sequentially through some models and in parallel across others, and the results are combined.The web interface sends API requests in batches of 3 images. Each image is first sent to an “analysis” actor, which then dispatches calls to the ML model actors.Initially, latency per request is low. However, after processing around 500–600 images, we observe a consistent pattern:Each ML actor shows increasing queue wait times.End-to-end latency grows significantly.Eventually, the Serve application stops processing altogether and hangs.We want to identify what could be causing the increasing queue delays and eventual system stall..
One key information is that the listen_for_change tasks keeps building up even after stable deployment. (And there is no autoscaling, we keep the replicas fixed in our development cluster). The pattern is that when the latency grows significantly, the listen_for_change also grows exponentially and the cluster stops responding.
1. Severity of the issue: (select one)
High: Completely blocks me.
2. Environment:
- Ray version: 2.46.0
- Python version: 3.10
- OS:
- Cloud/Infrastructure:
- Other libs/tools (if relevant):
3. What happened vs. what you expected:
- Expected: The requests should not be held in a queue (since the memory usage is also not hitting the maximum (for both the cpu and the gpu)
- Actual: The requests keep waiting and after a point the cluster stops responding.
We can see below that 7350 listen_for_change tasks are created.
This issue inevitable appears after we process around 500-600 frames.