Yes, this is a significant operational risk: if spilled objects are not cleaned up and disk fills up, the cluster can become stuck and will not recover automatically—manual intervention is required (e9bfd7ea, b6198811). After a job fails, in most cases, its references should be released, but Ray sometimes retains references if the driver or actors do not exit cleanly, or if there are lingering references in other jobs or detached actors (56930687). This can prevent spilled objects from being deleted, especially if the job did not shut down gracefully.
If the job is truly dead and all references are gone, the objects should eventually be cleaned up, but in practice, leaks or stuck references can occur, especially in complex or error-prone workloads (56930687). This is why Ray clusters in production need careful monitoring and sometimes custom cleanup logic.
Would you like more detail on how to monitor for lingering references or best practices for robust production deployments?
Sources:
Hint: Mention @RunLLM in the post for followups.