(raylet) node_manager.cc Workers (tasks / actors) killed due to memory pressure (OOM)

Hello,
I recently updated Ray to the latest version (2.9.3) and ever since, I started getting this OOM issue. My tuning process runs fine for a few trials, but then I get the error. It seems like there is no memory cleaning happening between trials in the same worker node. The thing is, when using v2.5.1, I didn’t have this issue. I did double the amount of memory available at the workers nodes, to no effect. Any ideas as to why this is happening?

@denmarc do you have a simple reproducible script for your workload? it’s hard to know what’s going on without it.

I ended up noticing that who needed more memory was the head node, supplying it solved the issue.

1 Like