How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hello, Recently I was using ray and modin to do some complex computation, but I found a undesired memory issue. I found ray’s memory usage goes up and never release after I run some complex computation on ray dashboard. After that, later computation might fail due to worker was out of memory and was killed.
I search this issue by myself in two way. Neither help. In a ray node, by using “ray memory” CLI command, I don’t found object which use a lot memory.
Then I see ray dashboard and use htop command on one ray node according to troubleshooting-out-of-memory-how-to-detect.
One node has 50+ worker. After complex computation, every worker occupys several hunderd MB or several GB memory.
Then I see detail memory information of one worker. Shared memory occupy much memory. Compare to initial ray status after reboot, shared memory is much less. And actual memory ray use is rss - shard = 888 - 615 = 273 MB, which is not very much.
After Complex Computation
rss:888.07MB
vms:64.55GB
shared:615.03MB
text:1.89MB
lib:0.0000B
data:1.30GB
dirty:0.0000B
After Reboot
rss:148.10MB
vms:60.06GB
shared:67.58MB
text:1.89MB
lib:0.0000B
data:547.86MB
dirty:0.0000B
So Here is my questions.
1、 How can I research this question further?
2、 Is shard memory occupy so much memory when compute nothing because ray use LRU cache and memory was not release even variables are not referred ?
3、 I found “However, they are not using 8 * SHR memory (there’s only 1 copy in the shared memory).” in troubleshooting-out-of-memory-how-to-detect. However, SHR memory seems sum up all copy of SHR of all worker in ray dahsboard, which lead to big memory comsumption?
Thanks all!