Expected RAM usage for PPOTrainer (debugging memory leaks)

mjlbach · March 16, 2022, 4:19pm

I’m using the latest Ray (1.11) and noticed the PPO trainer takes a fairly huge amount of ram, roughly 4 gb per worker with some apparent memory leak (training always inevitably crashes due to OOM about 500-600k environment stpes in). Does anyone have suggestions for debugging this? My rollout buffer size is 4096, my observation sizes are an RGBD image and a 2d birds eye map with two channels (128 x 128 x 3: float32, 128 x 128 x 1: float 32, 128 x 128 x 2: float32).

avnishn · May 17, 2022, 12:48am

Can you share the script that you use to launch your experiment? This would be a good starting place in helping you out.

If you don’t want to share your environment, you can substituted it for the RLlib random env :

mjlbach · May 19, 2022, 9:25pm

If you are ok downloading our assets (please message me on slack if you have an issue), I made this reproduction:

I’ll try to reproduce on random env.

mjlbach · May 19, 2022, 10:12pm

This takes up 18 gb roughly for me, does that sound about right?

avnishn · May 19, 2022, 10:15pm

thanks for sharing I’ll take a look

mjlbach · August 12, 2022, 1:53pm

Hi @avnishn! Any updates?

arturn · September 14, 2022, 9:00am

Hi @mjlbach ,

Sorry this took so long.
I’ve tried to reproduce this on Anyscale on master. I can’t reproduce with the provided script (thanks for providing a proper repro script!).
Here’s a screenshot of the resource consumption of roughly 10 hours of training:

Can you confirm that this is error still occurs on your side on master or at least 1.13 before we I try to investigate why it would occur in your environment?

Cheers

mjlbach · September 14, 2022, 9:05pm

Hi @arturn,

I still managed to reproduce this with ray 2.0 (14 gb of memory, same conda environment.yaml modified to use ray 2.0). I also tried with the latest nightly from today, same memory usage.

arturn · September 14, 2022, 9:35pm

The 3GB of memory that you are reading are from after the experiment was run. During the experiment it was very close to 14GB, just as with your run.
Can you reproduce the OOM with the nightlies?

mjlbach · September 14, 2022, 9:50pm

There’s no OOM, just the 14gb of memory usage. Is this memory usage to be expected?

arturn · September 15, 2022, 8:08am

14GB of memory used on a node that treats batches of roughly size 800MB does not seem too high for me.
You post says your are OOMing and the title includes the word memory leak, so I assumed that you where facing an OOM.

Topic		Replies	Views
PPO trainer eating up memory RLlib	9	2346	April 2, 2021
Ray PPO :: Memory keeps increasing Ray Core	1	502	March 18, 2021
Memory Leak when training PPO on a single agent environment RLlib	15	1643	December 24, 2022
[RLlib][Tune] Major memory leak 80GB (!) in 3 days (!) RLlib	1	340	June 3, 2021
PPO with PyTorch GPU has a RAM memory leak for Ray 1.6.0 RLlib	5	672	October 5, 2021

Expected RAM usage for PPOTrainer (debugging memory leaks)

Related topics