It seems likely that your concerns are valid: Ray currently does not support distributing spill files across nodes or using remote/shared disks for object spilling—spilling is always local to each node (e9bfd7ea, b6198811). While you can configure multiple spill paths per node, Ray will only use them sequentially on that node, not across the cluster. Using a large, shared network file system (like NFS or EFS) as a spill path is technically possible, but not officially recommended due to potential performance and consistency issues (95961ebc).
For production robustness, you should:
- Monitor disk usage and object store memory closely.
- Use Ray’s dashboard and tools like
ray memoryto detect lingering references and leaks. - Ensure jobs and actors exit cleanly, and consider implementing periodic cleanup scripts for spill directories.
- Consider using larger disks or auto-scaling policies to add nodes before disk pressure becomes critical.
- For long-running clusters, periodically restart nodes or the cluster to clear out any leaked objects (56930687).
Would you like more detail on configuring spill paths, monitoring, or automated cleanup strategies?
Sources:
Hint: Mention @RunLLM in the post for followups.