The node has 2T memory, 1008G shm. I tried running ray data in a docker container with "--shm-size=512gb".
Setting ray.init(object_store_memory=400 * 1024**3) only gives me 200G object store.
It’s a bit confusing
Ray’s object store memory is capped at 200GB by default, even if you request more via object_store_memory, unless you explicitly override this limit. This cap is set by the constant DEFAULT_OBJECT_STORE_MAX_MEMORY_BYTES (200GB) in Ray’s source code. To allocate more than 200GB, you must set the environment variable RAY_DEFAULT_OBJECT_STORE_MAX_MEMORY_BYTES to your desired value before starting Ray, or modify the Ray source code constant directly. Just increasing object_store_memory in ray.init() is not sufficient if it exceeds the default cap. See ray_constants.py and resource_spec.py.
Additionally, the actual object store memory cannot exceed the available shared memory (/dev/shm), so ensure your --shm-size is at least as large as your requested object store size. For more details on these caps and how Ray determines the final object store size, see resource_spec.py and services.py. Would you like a step-by-step guide to override the 200GB cap?
@RunLLM Setting neither env var works (DEFAULT_OBJECT_STORE_MAX_MEMORY_BYTES=429496729600 or RAY_DEFAULT_OBJECT_STORE_MAX_MEMORY_BYTES=429496729600). still getting 200G
Setting the environment variable RAY_DEFAULT_OBJECT_STORE_MAX_MEMORY_BYTES should override the 200GB cap, but it must be set in the environment before starting the Python process that launches Ray. If you set it after Python starts, it will not take effect. Also, ensure you use the correct variable name: RAY_DEFAULT_OBJECT_STORE_MAX_MEMORY_BYTES (not DEFAULT_OBJECT_STORE_MAX_MEMORY_BYTES). See ray_constants.py.
If you are still getting 200GB, double-check that the environment variable is exported in the shell before running your script, e.g., export RAY_DEFAULT_OBJECT_STORE_MAX_MEMORY_BYTES=429496729600 and then run your Python script in the same shell. Would you like more troubleshooting steps?
Your usage is correct, but Ray’s cap may still apply if the environment variable is not read early enough or if another limit (like /dev/shm size or Docker’s --shm-size) is lower. Also, Ray may cap the object store memory to the available shared memory or other internal checks, regardless of the environment variable. See services.py and ray_constants.py.
If you have set --shm-size to at least 400GB and the environment variable before the Python process, but still see 200GB, it may be a hard-coded cap or a bug. Would you like a step-by-step debug checklist?
This must be done before Python starts, as Ray reads the variable at import time (ray_constants.py).
Check /dev/shm Size: Confirm your Docker container or host has /dev/shm at least as large as your requested object store (e.g., df -h /dev/shm). If not, increase with --shm-size=400gb (services.py).
Check Ray Version: Make sure you are using Ray >=2.37.0, as older versions may not respect the environment variable (resource_spec.py).
Check for Warnings/Logs: Look for log messages about object store memory capping or warnings about /dev/shm in your Ray logs or stdout (services.py).
Verify in Python: Print the value inside your script to confirm Ray sees the variable:
import os
print(os.environ.get("RAY_DEFAULT_OBJECT_STORE_MAX_MEMORY_BYTES"))
Check Actual Allocation: After ray.init(), check the reported object store memory:
import ray
ray.init(object_store_memory=400 * 1024**3)
print(ray.cluster_resources())
Check for Other Limits: Ray may still cap the object store to the minimum of the environment variable, /dev/shm, and internal defaults (resource_spec.py).
If all steps are correct and it still caps at 200GB, it may be a Ray bug or a hard-coded limit in your version.