Raylet space running out, despite having plenty of RAM

CSmyth · March 24, 2023, 6:31pm

Hi, I’m new to ray, but loving it so far!
I’m a bit confused on the following issue. I’m running 3 jobs in parallel (i.e. 3 ray.remote() calls simultaneously), and getting the following warning:

(raylet) [2023-03-24 11:06:38,311 E 821713 821740] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2023-03-24_11-03-06_248168_820779 is over 95% full, available space: 5726846976; capacity: 1006449913856. Object creation will fail if spilling is required.

When these warnings arise, I still have ~120GB of RAM left unused, which is more than enough for the jobs I’m running.

I’m running on an ubuntu machine with 80 cpu cores available, and defining the remote jobs with
ray.remote(num_cpus=24)

I’m assuming it’s because I’m not defining the RAM allocation for each remote call? Is Ray trying to write data to my home directory? How might I go about fixing this? Any insight would be greatly appreciated!

Update: I am getting this warning now immediately after running ray.init(), without executing any code doing any work. So it must be how I’m initializing ray, but any help would still be greatly appreciated.

Jules_Damji · March 24, 2023, 7:36pm

@CSmyth Can you free up any crud that you might have in /tmp/. It wants to spill objects over onto disk if object store is full. You can specify in ray.init() where you want objects to be spilled to.

See if that helps.

CSmyth · March 24, 2023, 7:54pm

Hi @Jules_Damji thanks for the input. It’s good to know I can explicitly set the filepath for spill-over. I will do that in the future. I restarted the server and cleared crud from the /tmp directory.

However, how can I keep object store completely in RAM, so that it doesn’t write objects to disk? Or is that the default?

Jules_Damji · March 24, 2023, 8:16pm

One of the core benefits of Ray is its shared object store. It’s based on apache plasma, and implemented as shared-memory.

Small objects that a task or an actor creates (<=100kb) are stored in in-process memory. Any thing larger is put into the object store shared memory. When the object store reaches capacity, it needs to spill its LRU objects onto the disk.

And any tasks/actors that are on the same node can access data stored in the shared-memory via zero-copy since worker processes on the same node can access shared memory via pointers; no need to copy between processes.

Data stored on node A needed by task/actor on node B will get it by the Raylet/object store thread copying the original copy over to object store in node B.

Here is an object store tutorial can try on your machine to understand the concept behind Ray’s object store, and what’s is used for and its benefits.

You can specify how much object store memory to serve in your ray.init() call.

CSmyth · March 24, 2023, 8:18pm

Wonderful, thank you so much!

Jules_Damji · March 24, 2023, 8:23pm

You welcome. Marking this as resolved.

CSmyth · March 24, 2023, 8:30pm

@Jules_Damji Sorry, one last question. Shared memory is typically stored in RAM, correct?

Jules_Damji · March 24, 2023, 8:47pm

Shared memory on linux is stored on tmpfs, which is mounted /dev/shm

sangcho · March 27, 2023, 7:21am

To be clear, objects are created from the plasma store, and it uses /dev/shm. And the plasma store memory is set to “20% of the available memory when ray cluster is created (e.g., ray start)”.

you can increase the object store memory size using --object-store-memory flag.

Topic		Replies	Views
Ray Spilled the object even though lots of /dev/shm is empty Ray Core	6	369	September 29, 2023
Object store memory allocation on cluster Ray Core	3	1532	February 5, 2021
Raylet.err - Basic question Ray Core	3	468	March 21, 2023
Ray Serve Object Store Memory Issue: ray.exceptions.ObjectStoreFullError Ray Serve	1	508	April 24, 2021
Why is Ray spilling objects to disk even though there is enough memory Ray Core	6	997	January 19, 2021

Raylet space running out, despite having plenty of RAM

Related topics