Ray cluster deadlocked after drive full

RunLLM · December 2, 2025, 1:55pm

Yes, this is a significant operational risk: if spilled objects are not cleaned up and disk fills up, the cluster can become stuck and will not recover automatically—manual intervention is required (e9bfd7ea, b6198811). After a job fails, in most cases, its references should be released, but Ray sometimes retains references if the driver or actors do not exit cleanly, or if there are lingering references in other jobs or detached actors (56930687). This can prevent spilled objects from being deleted, especially if the job did not shut down gracefully.

If the job is truly dead and all references are gone, the objects should eventually be cleaned up, but in practice, leaks or stuck references can occur, especially in complex or error-prone workloads (56930687). This is why Ray clusters in production need careful monitoring and sometimes custom cleanup logic.

Would you like more detail on how to monitor for lingering references or best practices for robust production deployments?

Sources:

The pending tasks/actors remain on Ray cluster when the driver die unexpected

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Remote ray cluster not spilling to disk Ray Clusters	2	230	May 14, 2025
Pending Ray Jobs crashing ray cluster Ray Clusters	1	157	December 2, 2025
Subset of tasks stuck in "PENDING_NODE_ASSIGNMENT" forever Ray Clusters	9	2499	May 25, 2023
Ray Head restarting and leaving behind zombie processes Ray Clusters	0	198	March 12, 2024
Root disk usage keeps increasing Ray Core	3	766	April 4, 2023

Ray cluster deadlocked after drive full

Related topics