[Core] many GC requests from the node manager

raoul-khour-ts · December 17, 2020, 7:03pm

I am running some RLlib experiments on a distributed cluster ~20 machines. I am using the nightly.

I have some trainers that never manage to start rolling out (I do some data prefetch/loading during env init).

I was looking through session_latest.logs.raylet.out and noticed
~ every 500ms the node_manager was issuing a GC request to some of my workers.
I am seeing log lines ~ every half second that say:
sending local GC request to N workers. it is due to local memory pressure on the local worker.

If I check htop on my machines and the dashboard I see that my memory usage < 50% everwhere.

A) Is this normal?
B) any recommendations on how to debug further?

rliaw · December 17, 2020, 7:07pm

You’re using the nightly wheels right, @raoul-khour-ts?

cc @sangcho @ericl

raoul-khour-ts · December 17, 2020, 7:19pm

Yeah, I am using the nightly from yesterday.

sangcho · December 17, 2020, 7:20pm

Oh, we don’t actually trigger GC although that log was called. We always throttle the number of global gc (I think once per minute at maximum). so it is a spam log. We will remove that log from https://github.com/ray-project/ray/pull/12773/files

raoul-khour-ts · December 17, 2020, 7:22pm

That makes sense. But I am also curious why it thinks there is pressure it should not be doing any GC my memory usage should be relatively low.

I might be wrong but it seems like this is causing my dataloader to load forever

sangcho · December 17, 2020, 7:35pm

Are you seeing messages like this every 500ms?

Sending Python GC request to " << all_workers.size()
                   << " workers. It is due to memory pressure on the local node.";	
                   << " local workers to clean up Python cyclic references.";

If so, that’s actually pretty weird.

raoul-khour-ts · December 17, 2020, 7:47pm

It is actually saying:
"noce_manager.cc:530: Sending local GC request to n workers. It is due to memory pressure on the local node.

and it might be closer to 750ms

sangcho · December 18, 2020, 10:35am

Hmm actually, I cannot see those log messages from the latest master? Are you really using the nightly? Can you check

import ray
print(ray.__commit__)

And lmk what’s the commit of ray?

raoul-khour-ts · December 18, 2020, 4:55pm

I have not been using the nightly…

I was still using pip install -U the 1.1.0.dev wheels

So yeah a bit out dated Ill try the new nightly to see if this is still happening there.

Thanks @sangcho

rliaw · December 20, 2020, 6:13pm

Yeah, try 1.2.0.dev0

raoul-khour-ts · December 21, 2020, 6:41pm

it seems to work there thanks @rliaw

Topic		Replies	Views
Memory error in distributed multiprocessing	11	686	February 23, 2021
Running on individual node on Slurm Cluster Ray Core	1	22	November 15, 2024
Problem node running low on memory	3	2029	April 11, 2023
Memory management with non-exclusive node access RLlib	3	278	October 5, 2021
(raylet) node_manager.cc Workers (tasks / actors) killed due to memory pressure (OOM)	2	335	March 6, 2024

[Core] many GC requests from the node manager

Related topics