Raylet's crazy use of anon-rss led to the worker pod being killed by the system oom

Zrant_Andrew · May 14, 2025, 6:29am

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

Ray version: 2.40.0
Python version: 3.10
OS: ubuntu

3. What happened vs. what you expected:
I discovered a problem with the use of ray. The assignment suddenly failed, and it was found that the corresponding node had an OOM. Because the user turned off the memory_monitor, it was identified through the machine’s /var/log/message that the corresponding worker copy had an oom-killer situation.

By observing the logs of the corresponding time points, it was found that OM-Killer has been killing all the time. First, it killed the user’s job programs, such as task programs like readParquet. Because object-store was used, after killing these programs, shmem-rss was not released. At this time, the oom-killer was still running until it finally killed raylet.

Observing the log, it was found that the anon-rss of raylet was as high as 37GB, and the anon-rss of raylet of another node was also as high as 32GB. Since our pod only has 47G of memory allocated and the default shared memory is 30%, it should be occupied by 14G.

These programs have two characteristics: 1. They are frantically using object-store and have the behavior of spill, but the speed of spill is slower than the growth rate. So that in raylet’s log, there will be a brief appearance that the shared memory used is greater than 14G, such as 17 or 18G. 2 The anon-rss consumed by the Raylets of these problematic nodes is all over 30+G, while the user’s task or actor only consumes 1-3G of memory. The final result is that OOM occurred at the node. Why does raylet occupy such a high amount of anon-rss

Topic		Replies	Views
(raylet) node_manager.cc Workers (tasks / actors) killed due to memory pressure (OOM)	2	324	March 6, 2024
Weird error logs when running Out Of Memory (OOM) Ray Core	6	2455	April 11, 2023
Although node memory usage is high, I don't want to kill my actor Ray Train	3	533	February 2, 2023
Parallel processing-OOM killer due to high memory	5	570	November 4, 2022
Remote ray cluster not spilling to disk Ray Clusters	2	84	May 14, 2025

Raylet's crazy use of anon-rss led to the worker pod being killed by the system oom

Related topics