Ray Spilled the object even though lots of /dev/shm is empty

Even though I have enough space in empty in /dev/shm and in ram still ray Spilled the object. I have shared the /dev/shm size and ram size as well. Could you please me what should be the reason? Thanks in advance.

can you share the output of ray status? Ray spills if the object store memory is full, and the default objet store memory is 30% of the available RAM when ray is started. It is not decided by /dev/shm

Hi @sangcho thank you for your reply!! My systems’s memory is 792gb, I have set ‘object_store_memory’ from ray.init() to 350gb. It still uses more than 50% of /dev/shm, which is 734gb. Batch size I am using 1024 now. I see it does not use all the cpus.

It means if your workload requires more than 350GB of object memory, it will start spilling (/dev/shm doesn’t matter here). Are you saying it starts spilling objects although your object store usage is < 350GB? (you can also see spilling info using ray memory --stats-only, and see what objects are leaking using ray summary objects)

In both below cases I see object storage is not fully occupied(more than 40% remains unused).

  • If I explicitly specify object_store_memory=350 then it does not spill but uses /dev/shm,.

  • If I do not explicitly specify object_store_memory then it spills.

My questions are:

  1. Why it will use /dev/shm when object storage is not full utilized?
  2. Do I need to explicitly specify object storage to avoid spilling?

Thank you very much @sangcho

  1. Why it will use /dev/shm when object storage is not full utilized?

/dev/shm is the shared memory. Ray object store == shared memory storage, and that’s why it is using it. So basically, ray object store always writes data to /dev/shm, and if object store reaches to limit, it starts to spill to files (in disk).

  1. Do I need to explicitly specify object storage to avoid spilling?

It depends on your workloads. You can either 1. reduce the concurrent mem usage from your workloads (this is how ray data works. It reduces the object store memory usage by using streaming execution by default). 2. increase the object store memory capacity. Note that in this case, you will have less memory for your workers.

One more context; Ray always allocates 30% of available memory when it starts for object store memory (using /dev/shm). So you can think, each worker uses RAM of 70%, the core system procs share the 70% of RAM, and 30% of mem is used for object store (which stores the output of ray tasks when it is bigger than 100KB)