Raylet space running out, despite having plenty of RAM

Hi, I’m new to ray, but loving it so far!
I’m a bit confused on the following issue. I’m running 3 jobs in parallel (i.e. 3 ray.remote() calls simultaneously), and getting the following warning:

(raylet) [2023-03-24 11:06:38,311 E 821713 821740] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2023-03-24_11-03-06_248168_820779 is over 95% full, available space: 5726846976; capacity: 1006449913856. Object creation will fail if spilling is required.

When these warnings arise, I still have ~120GB of RAM left unused, which is more than enough for the jobs I’m running.

I’m running on an ubuntu machine with 80 cpu cores available, and defining the remote jobs with
ray.remote(num_cpus=24)

I’m assuming it’s because I’m not defining the RAM allocation for each remote call? Is Ray trying to write data to my home directory? How might I go about fixing this? Any insight would be greatly appreciated!

Update: I am getting this warning now immediately after running ray.init(), without executing any code doing any work. So it must be how I’m initializing ray, but any help would still be greatly appreciated.

@CSmyth Can you free up any crud that you might have in /tmp/. It wants to spill objects over onto disk if object store is full. You can specify in ray.init() where you want objects to be spilled to.

See if that helps.

Hi @Jules_Damji thanks for the input. It’s good to know I can explicitly set the filepath for spill-over. I will do that in the future. I restarted the server and cleared crud from the /tmp directory.

However, how can I keep object store completely in RAM, so that it doesn’t write objects to disk? Or is that the default?

One of the core benefits of Ray is its shared object store. It’s based on apache plasma, and implemented as shared-memory.

Small objects that a task or an actor creates (<=100kb) are stored in in-process memory. Any thing larger is put into the object store shared memory. When the object store reaches capacity, it needs to spill its LRU objects onto the disk.

And any tasks/actors that are on the same node can access data stored in the shared-memory via zero-copy since worker processes on the same node can access shared memory via pointers; no need to copy between processes.

Data stored on node A needed by task/actor on node B will get it by the Raylet/object store thread copying the original copy over to object store in node B.

Here is an object store tutorial can try on your machine to understand the concept behind Ray’s object store, and what’s is used for and its benefits.

You can specify how much object store memory to serve in your ray.init() call.

1 Like

Wonderful, thank you so much!

You welcome. Marking this as resolved.

@Jules_Damji Sorry, one last question. Shared memory is typically stored in RAM, correct?

Shared memory on linux is stored on tmpfs, which is mounted /dev/shm

1 Like

To be clear, objects are created from the plasma store, and it uses /dev/shm. And the plasma store memory is set to “20% of the available memory when ray cluster is created (e.g., ray start)”.

you can increase the object store memory size using --object-store-memory flag.

1 Like