Raylet's crazy use of anon-rss led to the worker pod being killed by the system oom

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

  • Ray version: 2.40.0
  • Python version: 3.10
  • OS: ubuntu

3. What happened vs. what you expected:
I discovered a problem with the use of ray. The assignment suddenly failed, and it was found that the corresponding node had an OOM. Because the user turned off the memory_monitor, it was identified through the machine’s /var/log/message that the corresponding worker copy had an oom-killer situation.

By observing the logs of the corresponding time points, it was found that OM-Killer has been killing all the time. First, it killed the user’s job programs, such as task programs like readParquet. Because object-store was used, after killing these programs, shmem-rss was not released. At this time, the oom-killer was still running until it finally killed raylet.

Observing the log, it was found that the anon-rss of raylet was as high as 37GB, and the anon-rss of raylet of another node was also as high as 32GB. Since our pod only has 47G of memory allocated and the default shared memory is 30%, it should be occupied by 14G.
image
image

These programs have two characteristics: 1. They are frantically using object-store and have the behavior of spill, but the speed of spill is slower than the growth rate. So that in raylet’s log, there will be a brief appearance that the shared memory used is greater than 14G, such as 17 or 18G. 2 The anon-rss consumed by the Raylets of these problematic nodes is all over 30+G, while the user’s task or actor only consumes 1-3G of memory. The final result is that OOM occurred at the node. Why does raylet occupy such a high amount of anon-rss