Ray cluster deadlocked after drive full

RunLLM · December 2, 2025, 2:56pm

If a worker is not killed even when there is no work left, it is often due to Ray’s autoscaler or resource manager not detecting the node as idle—this can happen if there are lingering references, stuck processes, or if the node is under disk pressure. When a node’s disk is completely full, Ray may be unable to write logs, update state, or clean up spill files, which can cause both the autoscaler and the dashboard to malfunction or hang (e9bfd7ea, b6198811). Stuck workers can occur if they are waiting for resources, blocked on I/O, or if the node is in a broken state due to full disk.

In this situation, the full disk likely prevents Ray from cleaning up, updating its state, or terminating workers properly. Manual intervention is required: free up disk space by deleting old spill files or logs, and consider restarting the affected node(s) to restore normal operation (e9bfd7ea).

Would you like step-by-step instructions for cleaning up disk space and recovering the cluster?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Remote ray cluster not spilling to disk Ray Clusters	2	174	May 14, 2025
Pending Ray Jobs crashing ray cluster Ray Clusters	1	96	December 2, 2025
Subset of tasks stuck in "PENDING_NODE_ASSIGNMENT" forever Ray Clusters	9	2397	May 25, 2023
Ray Head restarting and leaving behind zombie processes Ray Clusters	0	176	March 12, 2024
Root disk usage keeps increasing Ray Core	3	723	April 4, 2023

Ray cluster deadlocked after drive full

Related topics