Ray cluster deadlocked after drive full

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: ray[all]==2.52.0
  • Python version: 3.12.7
  • OS: Amazon Linux
  • Cloud/Infrastructure: AWS

3. What happened vs. what you expected:

I am very new to this cloud stuff, so be kind to me!

After a week of reading fiddling and coding I have a ray compute cluster running, which is awesome.

I run 4 CPU nodes and 2 GPU nodes, all of those nodes have a lot of ram (500gb and 300gb) but only a very small gp3 drive attached (100gb)

I first thougth that the object store will be distributed over all the RAM in the nodes, so did not think I will need much space for spilling. It seams I misunderstood here, sadly.
I work with pointclouds with billions of points, so calculations sometimes lead to some serious spilling.

I now wanted to stress test the cluster, and started 5 jobs at once. At some point a drive of a node ran full and jobs started to die.
Most remote calls in the pipeline do not (yet) tell ray any resource needs, so it makes sense that stuff explodes, i was expecting that.

Now from the 5 jobs, 3 died and 2 are completely stuck as it seams.
The drives of 2 nodes are still full to the brim, so no data from failed jobs were deleted I believe.
Both ‘running’ jobs have a bunch of tasks ‘PENDING_NODE_ASSIGNMENT’ but are never assigned. the tasks all just have a resource need of 1 CPU.
In the overview the cluster shows resource status 43.0/384.0 CPU, so I am unsure why the tasks are not assigned?

So my questions; why is the spilled data of failed jobs not deleted?
Why are the tasks not assigned to nodes?

Thank you!

The spilled data from failed jobs is not automatically deleted because Ray does not currently implement disk-specific load balancing or automatic cleanup of spilled objects when jobs fail; spilled objects remain until their references are explicitly released or the node is restarted. This can lead to nodes running out of disk space, which in turn blocks new tasks from being scheduled, even if there are available CPUs, because Ray cannot create new objects or spill further when the disk is full. This behavior is confirmed in both user reports and Ray’s documentation, which note that Ray does not rebalance or clean up spilled objects across nodes and that disk pressure can block task scheduling even if other resources are available (e9bfd7ea, b6198811, 159bc4bb).

Tasks remain in the PENDING_NODE_ASSIGNMENT state because, with the disk full, Ray cannot create or spill new objects required for task execution, effectively blocking scheduling on those nodes. Even if CPUs are available, Ray’s scheduler will not assign tasks if the node cannot handle object creation due to disk pressure (b6198811, e9bfd7ea). To resolve this, you should manually clean up the spill directories on the affected nodes or restart the nodes to free up disk space. For future runs, consider increasing disk size or tuning your workload to reduce object store and spill usage.

Would you like more detail on how to monitor or clean up spilled objects, or on configuring Ray to better handle large workloads?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM
I see. I guess that makes sense but is also quite a big problem.
So for production use we have to be very careful that this never happens, because if it does the whole cluster stops functioning and cant recover by itself?

This seams like a scary situation. When customer data that we dont have full control over start pouring in we wil have a hard time making the cluster robust.

You said that nothing is deleted from the object store if there is still something referenceing it. but after a job failed, is there anythiung left referencing it? Is the failed job still ‘there’ somehow ‘alive’ and keeping references?
I would think a failed job is dead and killed, so it cant hold any references anymore

Yes, this is a significant operational risk: if spilled objects are not cleaned up and disk fills up, the cluster can become stuck and will not recover automatically—manual intervention is required (e9bfd7ea, b6198811). After a job fails, in most cases, its references should be released, but Ray sometimes retains references if the driver or actors do not exit cleanly, or if there are lingering references in other jobs or detached actors (56930687). This can prevent spilled objects from being deleted, especially if the job did not shut down gracefully.

If the job is truly dead and all references are gone, the objects should eventually be cleaned up, but in practice, leaks or stuck references can occur, especially in complex or error-prone workloads (56930687). This is why Ray clusters in production need careful monitoring and sometimes custom cleanup logic.

Would you like more detail on how to monitor for lingering references or best practices for robust production deployments?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM
Oh well damn. I was wishing for the cluster to be the holy grail solution to our issues, but I am less confident now.

So to mitigate this issue we need enough space to spill on every node. Which will get unnecessary expensive quite fast probably, especically because most nodes will not need the space most of the time. just some odd circumstances will lead to such issues.

It would really be good if this spilling could also be distributed, like using the drive of another node.

If I remember correctly it is possible to define multliple spill paths, that are used after one another. So we could get a little super fast drive for every node, and then as a fallback something huge and slow, maybe even shared between the nodes?

Anyway, what can I do in a production environment to make this more robust, clean up dead jobs, and fight lingering references.
Leaking cant be allowed in a system that should stay running.

It seems likely that your concerns are valid: Ray currently does not support distributing spill files across nodes or using remote/shared disks for object spilling—spilling is always local to each node (e9bfd7ea, b6198811). While you can configure multiple spill paths per node, Ray will only use them sequentially on that node, not across the cluster. Using a large, shared network file system (like NFS or EFS) as a spill path is technically possible, but not officially recommended due to potential performance and consistency issues (95961ebc).

For production robustness, you should:

  • Monitor disk usage and object store memory closely.
  • Use Ray’s dashboard and tools like ray memory to detect lingering references and leaks.
  • Ensure jobs and actors exit cleanly, and consider implementing periodic cleanup scripts for spill directories.
  • Consider using larger disks or auto-scaling policies to add nodes before disk pressure becomes critical.
  • For long-running clusters, periodically restart nodes or the cluster to clear out any leaked objects (56930687).

Would you like more detail on configuring spill paths, monitoring, or automated cleanup strategies?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM

When attached to the cluster I can see that no actor or task is left alive (except _stats_actor)
There are still 4 pinned objects though that should probably not be there?

Table (group by class):
------------------------------------
    CLASS_NAME          STATE_COUNTS
0   JobSupervisor       DEAD: 5
1   RemoteContextStore  DEAD: 1
2   _StatsActor         ALIVE: 1

root@ip-172-31-11-111:/app# ray summary tasks
2025-12-02 14:39:08,219 INFO api.py:68 -- (3.75 / 30 seconds) Waiting for the response from the API server address http://172.31.11.111:8265/api/v0/tasks/summarize.

======== Tasks Summary: 2025-12-02 14:39:11.454628 ========
Stats:
------------------------------------
total_actor_scheduled: 0
total_actor_tasks: 1031
total_tasks: 8969


Table (group by func_name):
------------------------------------
    FUNC_OR_CLASS_NAME           STATE_COUNTS    TYPE
0   unpack_node                  FAILED: 24      NORMAL_TASK
                                 FINISHED: 8943
1   _StatsActor.get_datasets     FINISHED: 203   ACTOR_TASK
2   RemoteContextStore.get_all   FINISHED: 827   ACTOR_TASK
3   reduce_partitions            FINISHED: 1     NORMAL_TASK
4   RemoteContextStore.set_many  FINISHED: 1     ACTOR_TASK
5   download_dataset             FINISHED: 1     NORMAL_TASK


root@ip-172-31-11-111:/app# ray list objects 

======== List: 2025-12-02 14:43:53.449378 ========
Stats:
------------------------------
Total: 4

Table:
------------------------------
    OBJECT_ID                                                 OBJECT_SIZE    TASK_STATUS      ATTEMPT_NUMBER  REFERENCE_TYPE    CALL_SITE                                                           TYPE       PID  IP
 0  006fadbbb879c3b0ffffffffffffffffffffffff0500000002e1f505  807.000 B      NIL                           1  PINNED_IN_MEMORY  (deserialize task arg) downloader.src.unpacking_worker.unpack_node  WORKER  256089  172.31.0.25
 1  006fadbbb879c3b0ffffffffffffffffffffffff0500000003e1f505  351.000 B      NIL                           1  PINNED_IN_MEMORY  (deserialize task arg) downloader.src.unpacking_worker.unpack_node  WORKER  256089  172.31.0.25
 2  006fadbbb879c3b0ffffffffffffffffffffffff0500000004e1f505  10.000 B       NIL                           1  PINNED_IN_MEMORY  (deserialize task arg) downloader.src.unpacking_worker.unpack_node  WORKER  256089  172.31.0.25
 3  006fadbbb879c3b0ffffffffffffffffffffffff05000000b108f605  386.284 KiB    NIL                           1  PINNED_IN_MEMORY  (deserialize task arg) downloader.src.unpacking_worker.unpack_node  WORKER  256089  172.31.0.25

Still, the tmp is not cleaned out. Also the cluster is not downscaling, even though no work has been done in quite some time

The 4 “PINNED_IN_MEMORY” objects remain because the worker process (PID 256089) still holds deserialized references to them, even though all actors and tasks are dead. In Ray, objects are only released from the object store when all references—including those in worker memory—are gone. If a worker process does not exit cleanly or leaks references, these objects can remain pinned, preventing cleanup and blocking cluster downscaling (ba752da8, 624d3d09).

To resolve this, ensure the worker process (PID 256089) is terminated or restarts, which should release the references and allow Ray to clean up the objects and downscale the cluster. If the process is stuck, you may need to kill it manually. For robust production use, monitor for lingering worker processes and automate cleanup if necessary.

Would you like more detail on how to automate detection and cleanup of such leaks?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM

But why is the worker not killed if there is no work left?

Also looking in the cluster I see that the Disks are still full, multiple hundred GB, so its not those objects..?

In the dashboard I see on one node that it says WORKER (2) sadly the dashboard crashes when i click on that tab. So unsure whatsup with that.
how can it happen that workers get stuck like this?
This is one of the machines with completely full disk, only 20kb free. This might have broken things?

If a worker is not killed even when there is no work left, it is often due to Ray’s autoscaler or resource manager not detecting the node as idle—this can happen if there are lingering references, stuck processes, or if the node is under disk pressure. When a node’s disk is completely full, Ray may be unable to write logs, update state, or clean up spill files, which can cause both the autoscaler and the dashboard to malfunction or hang (e9bfd7ea, b6198811). Stuck workers can occur if they are waiting for resources, blocked on I/O, or if the node is in a broken state due to full disk.

In this situation, the full disk likely prevents Ray from cleaning up, updating its state, or terminating workers properly. Manual intervention is required: free up disk space by deleting old spill files or logs, and consider restarting the affected node(s) to restore normal operation (e9bfd7ea).

Would you like step-by-step instructions for cleaning up disk space and recovering the cluster?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM

i will reboot the cluster with more storage.
But how much more, I feel like this is just kicking the can down the road

You’re correct—simply adding more storage may only delay the problem if the underlying causes (object leaks, unbounded spilling, or lack of cleanup) aren’t addressed. The best practice is to size storage so that the working set and expected spill volume fit comfortably, but also to monitor and control memory usage, spill rates, and reference lifetimes (ba752da8, 7094df69).

For robust production, combine increased storage with:

  • Monitoring object store and disk usage.
  • Limiting the number of concurrent jobs/tasks.
  • Ensuring all references are released after job completion.
  • Periodic cleanup or node restarts to clear leaks.

Would you like more detail on how to estimate storage needs or set up monitoring and cleanup?

Sources:

Hint: Mention @RunLLM in the post for followups.