Raylet.err - Basic question

How severe does this issue affect your experience of using Ray?

  • Normal: Trying to understand the Ray components.

Hi there, One of my task died and I was looking into the raylet.err file and saw the following:

1[2023-03-16 11:19:12,486 I 27 27] (raylet) io_service_pool.cc:35: IOServicePool is running with 1 io_service.

2[2023-03-16 11:19:12,487 I 27 27] (raylet) store_runner.cc:32: Allowing the Plasma store to use up to 50.4927GB of memory.

3[2023-03-16 11:19:12,487 I 27 27] (raylet) store_runner.cc:48: Starting object store with directory /dev/shm, fallback /tmp/ray, and huge page support disabled

4[2023-03-16 11:19:12,487 W 27 27] (raylet) store_runner.cc:65: System memory request exceeds memory available in /dev/shm. The request is for 50492709273 bytes, and the amount available is 48318382080 bytes. You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.

5[2023-03-16 11:19:12,487 I 27 92] (raylet) dlmalloc.cc:154: create_and_mmap_buffer(48318382088, /dev/shm/plasmaXXXXXX)

6[2023-03-16 11:19:12,487 I 27 92] (raylet) store.cc:554: ========== Plasma store: =================

7 Current usage: 0 / 48.3184 GB

I ran df -kh inside the container and see the following:

(base) ray@ed-car-raycluster-kuberay-worker-r-2xlarge-4-spot-w9rds:~/app$ df -kh
Filesystem      Size  Used Avail Use% Mounted on
overlay         500G   78G  423G  16% /
tmpfs            64M     0   64M   0% /dev
tmpfs            30G     0   30G   0% /sys/fs/cgroup
/dev/xvda1      500G   78G  423G  16% /tmp/ray
tmpfs            50G   32M   50G   1% /dev/shm
tmpfs            59G   12K   59G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs            59G  4.0K   59G   1% /run/secrets/eks.amazonaws.com/serviceaccount
tmpfs            30G     0   30G   0% /proc/acpi
tmpfs            30G     0   30G   0% /proc/scsi
tmpfs            30G     0   30G   0% /sys/firmware

The underlying instance type is r4.2xlarge with 61GB of RAM. The EKS shows 59.86GB RAM available.

My question is:
0. How is the raylet assigned 50GB RAM ?

  1. Can the warning lead to any issue causing task to fail? If yes,
  2. What can one do about it?

@ckapoor can you share what’s the error message of your failed task? the error message alone won’t causing task to fail, but if the raylet is configured to use too much memory for object store, the task might fail due to out of memory errors.

by default the ray object store memory is configured to be 30% of the system memories. In your case it seems you are configuring way more than that. May i know how you start the ray cluster? You can manually override Ray object store memory by adding --object-store-memory= in your ray start command.

@Chen_Shen Our Ray clusters are started using a templated script. I dug deep into it and it seems that the Object Store memory assigned is equal to the memory available on the Kubernetes pod which aligns closely with what is shown above in /dev/shm

Can you please elaborate on ?

if the raylet is configured to use too much memory for object store, the task might fail due to out of memory errors.

Does it hold true for Ray DataSets as well ?

yeah i’d suggest adjusting that value to 30% of physical memory.

Can you please elaborate on ?

if the raylet is configured to use too much memory for object store, the task might fail due to out of memory errors.

During ray task execution, both heap memory and object store memory are used. For example if you have following function

@ray.remote
def foo(arg):
   val = _your_python_compute(arg)
   return val

the val generated by _your_python_compute will be allocated in heap memory and stored into object store before it returns.
for the arg, depending on if the arg is zero copy serializable, the arg may or may not need to be deserialized into heap memory. (you can following the above mentioned link for more information).
Since heap memory usage is inevitable, it’s suggested to set plasma store only use 30% of the available memory.