Ray cluster hangs indefinitely with thousands of listen_for_change tasks

yoda · September 4, 2025, 1:47pm

We have a Ray Serve application which contains multiple ML models deployed primarily for computer vision application. The individual latencies of the Actors are quite low(~100 ms).

Each image is processed sequentially through some models and in parallel across others, and the results are combined.The web interface sends API requests in batches of 3 images. Each image is first sent to an “analysis” actor, which then dispatches calls to the ML model actors.Initially, latency per request is low. However, after processing around 500–600 images, we observe a consistent pattern:Each ML actor shows increasing queue wait times.End-to-end latency grows significantly.Eventually, the Serve application stops processing altogether and hangs.We want to identify what could be causing the increasing queue delays and eventual system stall..

One key information is that the listen_for_change tasks keeps building up even after stable deployment. (And there is no autoscaling, we keep the replicas fixed in our development cluster). The pattern is that when the latency grows significantly, the listen_for_change also grows exponentially and the cluster stops responding.

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

Ray version: 2.46.0
Python version: 3.10
OS:
Cloud/Infrastructure:
Other libs/tools (if relevant):

3. What happened vs. what you expected:

Expected: The requests should not be held in a queue (since the memory usage is also not hitting the maximum (for both the cpu and the gpu)
Actual: The requests keep waiting and after a point the cluster stops responding.

We can see below that 7350 listen_for_change tasks are created.

This issue inevitable appears after we process around 500-600 frames.

Topic		Replies	Views
Autoscaling is very slow and not working correctly Ray Clusters	6	625	April 30, 2021
Ray Serve http queued call hangs if workers are busy Ray Serve	5	79	April 17, 2025
Join tasks getting stuck in PENDING_NODE_ASSIGNMENT Ray Data	7	115	May 21, 2025
System will be halted when tasks number is large Ray Core	32	1816	April 28, 2023
Subset of tasks stuck in "PENDING_NODE_ASSIGNMENT" forever Ray Clusters	9	2228	May 25, 2023

Ray cluster hangs indefinitely with thousands of listen_for_change tasks

Related topics