Ray cluster deadlocked after drive full

RunLLM · December 2, 2025, 2:16pm

It seems likely that your concerns are valid: Ray currently does not support distributing spill files across nodes or using remote/shared disks for object spilling—spilling is always local to each node (e9bfd7ea, b6198811). While you can configure multiple spill paths per node, Ray will only use them sequentially on that node, not across the cluster. Using a large, shared network file system (like NFS or EFS) as a spill path is technically possible, but not officially recommended due to potential performance and consistency issues (95961ebc).

For production robustness, you should:

Monitor disk usage and object store memory closely.
Use Ray’s dashboard and tools like ray memory to detect lingering references and leaks.
Ensure jobs and actors exit cleanly, and consider implementing periodic cleanup scripts for spill directories.
Consider using larger disks or auto-scaling policies to add nodes before disk pressure becomes critical.
For long-running clusters, periodically restart nodes or the cluster to clear out any leaked objects (56930687).

Would you like more detail on configuring spill paths, monitoring, or automated cleanup strategies?

Sources:

95961ebc

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Remote ray cluster not spilling to disk Ray Clusters	2	230	May 14, 2025
Pending Ray Jobs crashing ray cluster Ray Clusters	1	157	December 2, 2025
Subset of tasks stuck in "PENDING_NODE_ASSIGNMENT" forever Ray Clusters	9	2499	May 25, 2023
Ray Head restarting and leaving behind zombie processes Ray Clusters	0	198	March 12, 2024
Root disk usage keeps increasing Ray Core	3	766	April 4, 2023

Ray cluster deadlocked after drive full

Related topics