I am running some RLlib experiments on a distributed cluster ~20 machines. I am using the nightly.
I have some trainers that never manage to start rolling out (I do some data prefetch/loading during env init).
I was looking through session_latest.logs.raylet.out and noticed
~ every 500ms the node_manager was issuing a GC request to some of my workers.
I am seeing log lines ~ every half second that say:
sending local GC request to N workers. it is due to local memory pressure on the local worker.
If I check htop on my machines and the dashboard I see that my memory usage < 50% everwhere.
A) Is this normal?
B) any recommendations on how to debug further?
Oh, we don’t actually trigger GC although that log was called. We always throttle the number of global gc (I think once per minute at maximum). so it is a spam log. We will remove that log from https://github.com/ray-project/ray/pull/12773/files
Sending Python GC request to " << all_workers.size()
<< " workers. It is due to memory pressure on the local node.";
<< " local workers to clean up Python cyclic references.";