[RLlib][Tune] Major memory leak 80GB (!) in 3 days (!)

holinov · April 27, 2021, 11:15am

I have a single node multi gpu ray cluser (v1.2.0)
I’m trainig an PPO agent an i’m having major memory leaks.
Trails on screenshot have nearly-same configuration (different hyper-parameters for reward function)
So you could see that some trails consume much more memory and memory consumption is constantly growing.
Just-started trail consumes only 1.2 GB

I could share some parts of my configuration if needed.
Does anyone know why this could happen and how to overcome it?

sven1977 · June 3, 2021, 2:15pm

Hey @holinov , thanks for the question. We did fix a leak recently (should be in 1.3 or in current master) in the SampleCollector thanks to @Bam4d .
Yeah could you post your config here as well? Or even better, a small reproduction script so we could debug this?

Thanks!

Topic		Replies	Views
Expected RAM usage for PPOTrainer (debugging memory leaks) RLlib	10	960	September 15, 2022
Memory Leak when training PPO on a single agent environment RLlib	15	1659	December 24, 2022
[RLlib] GPU Memory Leak? Tune + PPO, Policy Server + Client RLlib	18	1239	May 29, 2023
PPO trainer eating up memory RLlib	9	2359	April 2, 2021
PPO with PyTorch GPU has a RAM memory leak for Ray 1.6.0 RLlib	5	673	October 5, 2021

[RLlib][Tune] Major memory leak 80GB (!) in 3 days (!)

Related topics