Why ray have memory leakage issue after complex tasks with modin?

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hello, Recently I was using ray and modin to do some complex computation, but I found a undesired memory issue. I found ray’s memory usage goes up and never release after I run some complex computation on ray dashboard. After that, later computation might fail due to worker was out of memory and was killed.

I search this issue by myself in two way. Neither help. In a ray node, by using “ray memory” CLI command, I don’t found object which use a lot memory.

Then I see ray dashboard and use htop command on one ray node according to troubleshooting-out-of-memory-how-to-detect.

One node has 50+ worker. After complex computation, every worker occupys several hunderd MB or several GB memory.

Then I see detail memory information of one worker. Shared memory occupy much memory. Compare to initial ray status after reboot, shared memory is much less. And actual memory ray use is rss - shard = 888 - 615 = 273 MB, which is not very much.

After Complex Computation

rss:888.07MB

vms:64.55GB

shared:615.03MB

text:1.89MB

lib:0.0000B

data:1.30GB

dirty:0.0000B

After Reboot

rss:148.10MB

vms:60.06GB

shared:67.58MB

text:1.89MB

lib:0.0000B

data:547.86MB

dirty:0.0000B

So Here is my questions.

1、 How can I research this question further?

2、 Is shard memory occupy so much memory when compute nothing because ray use LRU cache and memory was not release even variables are not referred ?

3、 I found “However, they are not using 8 * SHR memory (there’s only 1 copy in the shared memory).” in troubleshooting-out-of-memory-how-to-detect. However, SHR memory seems sum up all copy of SHR of all worker in ray dahsboard, which lead to big memory comsumption?

Thanks all!