Ray PPO :: Memory keeps increasing

Kai_Yun · March 18, 2021, 7:53am

System information

OS Platform and Distribution : Linux Centos 7.9.2009
Ray installed from (source or binary) : source
Ray version : 1.2.0
Python version : Python 3.6.12

I have around 500 GiB available for training PPO on my server. I’m using around 12 CPUs in total (3 trials, each with 3 rollout_workers and 1 trainer).
From the moment it starts running, the entire process takes up around 340 GiB and it gradually increments. Right now, after only 2 hours and half, it’s using up to 375 GiB. I do have the train_batch_size set to 5000, 15000, and 25000 for each trial respectively, so that should take up some memory. Nevertheless, the amount of memory it seems to require and the speed it increases are incomprehensible for me. The training halts usually after few hours due to low memory error.

The interesting thing is when I ran 10 trials (each also with 3 rollout_works and 1 trainer, so in total of 40 CPUs) on my other server, which has around 250 GiB, it initially only took up 32 GiB. It eventually failed after 8 hours due to low memory usage though.

Can someone explain why it’s so different for each server? Most of all, what should I do to limit the memory so that it still trains, but without abrupt stopping?

Thanks in advance.

sven1977 · March 18, 2021, 8:19am

Sorry, but it’s impossible to tell from this distance.
We are currently not aware of any memory leaks in PPO (neither tf nor torch). Could this possibly be your env that’s leaking? I’m currently running VizDoom trials on a 4-GPU machine with PPO and attention nets and am not observing any memory leaks. It’s running very stably over several days.

Topic		Replies	Views
PPO trainer eating up memory RLlib	9	2353	April 2, 2021
Expected RAM usage for PPOTrainer (debugging memory leaks) RLlib	10	956	September 15, 2022
PPO with PyTorch GPU has a RAM memory leak for Ray 1.6.0 RLlib	5	673	October 5, 2021
[RLlib] GPU Memory Leak? Tune + PPO, Policy Server + Client RLlib	18	1222	May 29, 2023
[RLlib][Tune] Major memory leak 80GB (!) in 3 days (!) RLlib	1	340	June 3, 2021

Ray PPO :: Memory keeps increasing

System information

Related topics