Question about confusing object spilling mechanism

1. Severity of the issue: (select one)
Low: Annoying but doesn’t hinder my work.

2. Environment:

  • Ray version: 2.47.1
  • Python version: 3.12.7+gc
  • OS: Ubuntu 24.04.1 LTS
  • Cloud/Infrastructure: kubernetes
  • Other libs/tools (if relevant): none

3. What happened vs. what you expected:

  • Expected: object spilling happens after object memory exceeds limit
  • Actual:

I have a cluster of 128 nodes, each node with 1.8T memory. I set RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE=1,the set object store memory limit to 1.5T

the start command is ray start --num-cpus=128 --num-gpus=8 --head --temp-dir=/tmp/ray --port=6379 --system-config=‘{“object_spilling_config”: “{"type": "filesystem", "params": {"directory_path": "/nvme/tmp/ray"}}”}’ --object-store-memory=1649267441664

the total object\_store\_memory capacity is expected, shown as below by ray status, due to 1.5 * 128 = 192TB

Total Usage:
1025.0/16384.0 CPU (1024.0 used of 1024.0 reserved in placement groups)
1024.0/1024.0 GPU (1024.0 used of 1024.0 reserved in placement groups)
0B/37.95TiB memory
79.01GiB/192.00TiB object_store_memory

when running workload, I found worker-0, where I start ray head node, triggers object spilling when memory usage is only about ~300GB

# free -h
total        used        free      shared  buff/cache   availableMem:           
1.8Ti       338Gi       1.0Ti       238Gi       688Gi       1.5Ti
Swap:             0B          0B          0B

and

# du -sh /nvme/tmp/ray/
812G    /nvme/tmp/ray/

I set a large object store memory, to try to avoid object spilling, because the total amount of data collected by head node is about 1.1TB. But the reality is I cannot avoid it.

So is it expected? Thx

Hi xial,

Welcome to the Ray community! Just wondering, what do you see if you run ray memory to debug the memory issues? You can read more about it here: Memory Management — Ray 2.48.0