1. Severity of the issue: (select one)
High: Completely blocks me.
2. Environment:
- Ray version: 2.40.0
- Python version: 3.10
- OS: ubuntu
3. What happened vs. what you expected:
I discovered a problem with the use of ray. The assignment suddenly failed, and it was found that the corresponding node had an OOM. Because the user turned off the memory_monitor, it was identified through the machine’s /var/log/message that the corresponding worker copy had an oom-killer situation.
By observing the logs of the corresponding time points, it was found that OM-Killer has been killing all the time. First, it killed the user’s job programs, such as task programs like readParquet. Because object-store was used, after killing these programs, shmem-rss was not released. At this time, the oom-killer was still running until it finally killed raylet.
Observing the log, it was found that the anon-rss of raylet was as high as 37GB, and the anon-rss of raylet of another node was also as high as 32GB. Since our pod only has 47G of memory allocated and the default shared memory is 30%, it should be occupied by 14G.
These programs have two characteristics: 1. They are frantically using object-store and have the behavior of spill, but the speed of spill is slower than the growth rate. So that in raylet’s log, there will be a brief appearance that the shared memory used is greater than 14G, such as 17 or 18G. 2 The anon-rss consumed by the Raylets of these problematic nodes is all over 30+G, while the user’s task or actor only consumes 1-3G of memory. The final result is that OOM occurred at the node. Why does raylet occupy such a high amount of anon-rss