1. Severity of the issue: (select one)
Low: Annoying but doesn’t hinder my work.
2. Environment:
- Ray version: 2.47.1
- Python version: 3.12.7+gc
- OS: Ubuntu 24.04.1 LTS
- Cloud/Infrastructure: kubernetes
- Other libs/tools (if relevant): none
3. What happened vs. what you expected:
- Expected: object spilling happens after object memory exceeds limit
- Actual:
I have a cluster of 128 nodes, each node with 1.8T memory. I set RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE=1
,the set object store memory limit to 1.5T
the start command is ray start --num-cpus=128 --num-gpus=8 --head --temp-dir=/tmp/ray --port=6379 --system-config=‘{“object_spilling_config”: “{"type": "filesystem", "params": {"directory_path": "/nvme/tmp/ray"}}”}’ --object-store-memory=1649267441664
the total object\_store\_memory capacity is expected, shown as below by ray status
, due to 1.5 * 128 = 192TB
Total Usage:
1025.0/16384.0 CPU (1024.0 used of 1024.0 reserved in placement groups)
1024.0/1024.0 GPU (1024.0 used of 1024.0 reserved in placement groups)
0B/37.95TiB memory
79.01GiB/192.00TiB object_store_memory
when running workload, I found worker-0, where I start ray head node, triggers object spilling when memory usage is only about ~300GB
# free -h
total used free shared buff/cache availableMem:
1.8Ti 338Gi 1.0Ti 238Gi 688Gi 1.5Ti
Swap: 0B 0B 0B
and
# du -sh /nvme/tmp/ray/
812G /nvme/tmp/ray/
I set a large object store memory, to try to avoid object spilling, because the total amount of data collected by head node is about 1.1TB. But the reality is I cannot avoid it.
So is it expected? Thx