If a worker is not killed even when there is no work left, it is often due to Ray’s autoscaler or resource manager not detecting the node as idle—this can happen if there are lingering references, stuck processes, or if the node is under disk pressure. When a node’s disk is completely full, Ray may be unable to write logs, update state, or clean up spill files, which can cause both the autoscaler and the dashboard to malfunction or hang (e9bfd7ea, b6198811). Stuck workers can occur if they are waiting for resources, blocked on I/O, or if the node is in a broken state due to full disk.
In this situation, the full disk likely prevents Ray from cleaning up, updating its state, or terminating workers properly. Manual intervention is required: free up disk space by deleting old spill files or logs, and consider restarting the affected node(s) to restore normal operation (e9bfd7ea).
Would you like step-by-step instructions for cleaning up disk space and recovering the cluster?
Sources:
Hint: Mention @RunLLM in the post for followups.