Memory monitor killing worker processes when executing external logic blocks

Ray’s memory monitor kills workers when total memory (heap + object store) exceeds the RAY_memory_usage_threshold (default 0.95), and this can be triggered by high object store usage, especially when passing large or complex objects (like external logic blocks) to remote tasks—even via ray.put()—due to serialization overhead and object pinning. This is a common issue when objects remain referenced or are not garbage collected promptly, causing memory not to be released even after tasks finish. Manual gc.collect() calls inside actors or after dereferencing objects can help, as Ray relies on Python’s garbage collector to release object store memory once all references are gone. However, this is a workaround, not a root fix. Also, Ray’s object store does not provide fine-grained eviction for specific objects; it evicts based on LRU when the store is full, and object spilling only applies when the object store is at capacity, not for heap memory pressure.

To optimize, consider: (1) reducing the size/complexity of objects passed to tasks (e.g., avoid passing large code blocks or use lighter-weight serialization), (2) batching or throttling task submissions to avoid memory spikes, (3) increasing num_cpus per task to limit parallelism, and (4) using gc.collect() after dereferencing large objects. For persistent memory issues, profile memory usage with tools like memray and monitor RSS vs SHR to distinguish heap from object store usage. For more details on memory management and troubleshooting, see Ray’s memory debugging guide and OOM prevention. Would you like a step-by-step breakdown or code examples for any of these strategies?

Sources:

Hint: Mention @RunLLM in the post for followups.