Memory monitor killing worker processes when executing external logic blocks

Hi everyone,
I’m currently working on a distributed processing task using Ray Core to scale some data-heavy operations. I’ve been trying to integrate and test some external logic structures to see how the raylet handles high-frequency task submission, but I’m running into a consistent issue where the Ray memory monitor starts killing my worker processes.

The main problem seems to be an unexpected growth in the object store when I attempt to pass these delta scripts as arguments to remote tasks. Even when I use ray.put() to store the logic blocks once and pass the object reference, I’m seeing the node memory usage spike until it hits the RAY_memory_usage_threshold, at which point the task is killed and retried with exponential backoff. It’s particularly frustrating because the same logic runs with a much smaller footprint in a standalone Python environment.

Has anyone here dealt with memory leaks or high serialization overhead when using Ray to execute third-party script logic? I’m wondering if I should be manually calling gc.collect() within my actors or if there’s a better way to configure the object store to evict these specific types of objects more aggressively. I’m also seeing some “jitter” in the dashboard metrics where the worker memory doesn’t seem to drop even after the task has finished. If anyone has tips on optimizing the resource requirements for external executors within a Ray cluster, I’d really appreciate the help!

Ray’s memory monitor kills workers when total memory (heap + object store) exceeds the RAY_memory_usage_threshold (default 0.95), and this can be triggered by high object store usage, especially when passing large or complex objects (like external logic blocks) to remote tasks—even via ray.put()—due to serialization overhead and object pinning. This is a common issue when objects remain referenced or are not garbage collected promptly, causing memory not to be released even after tasks finish. Manual gc.collect() calls inside actors or after dereferencing objects can help, as Ray relies on Python’s garbage collector to release object store memory once all references are gone. However, this is a workaround, not a root fix. Also, Ray’s object store does not provide fine-grained eviction for specific objects; it evicts based on LRU when the store is full, and object spilling only applies when the object store is at capacity, not for heap memory pressure.

To optimize, consider: (1) reducing the size/complexity of objects passed to tasks (e.g., avoid passing large code blocks or use lighter-weight serialization), (2) batching or throttling task submissions to avoid memory spikes, (3) increasing num_cpus per task to limit parallelism, and (4) using gc.collect() after dereferencing large objects. For persistent memory issues, profile memory usage with tools like memray and monitor RSS vs SHR to distinguish heap from object store usage. For more details on memory management and troubleshooting, see Ray’s memory debugging guide and OOM prevention. Would you like a step-by-step breakdown or code examples for any of these strategies?

Sources:

Hint: Mention @RunLLM in the post for followups.