Memory leak CPU RAM with Tf2 eager execution

Samuel_Showalter · July 24, 2021, 7:04pm

Hi,

When I run training for PPO or APPO on my machine (Ubuntu, 1 15gb GPU and 16 CPU cores with 65GB CPU RAM) I run into out of memory errors late into the training process. I am using the standard API with num_gpus=1 and num_workers=10. GPU memory remains stable during this time, but after some debugging I can watch each of the rollout workers bloat up from ~2Gb in GPU RAM to well over 5Gb, causing my server to choke and kill the process.

I have not been able to figure out the cause of this yet. I am using a custom environment, but have not seen this issue when training on it with other agents. I also did not have this issue with tf1. Any thoughts?

Samuel_Showalter · August 2, 2021, 11:40pm

EDIT: It looks like this is a relatively long-standing issue. I was able to get Rllib to work fine with torch, so this is no longer an issue for me.

sven1977 · August 3, 2021, 3:28pm

Hey @Samuel_Showalter , sorry for staling this. I agree, we do have an open flank on tf-eager/tf2. It’s on our TODO list to make RLlib with tf-eager more performant and stable.

Topic		Replies	Views
PPO with PyTorch GPU has a RAM memory leak for Ray 1.6.0 RLlib	5	669	October 5, 2021
PPO torch vs tf2 Configure Algorithm, Training, Evaluation, Scaling	3	466	May 24, 2023
Expected RAM usage for PPOTrainer (debugging memory leaks) RLlib	10	949	September 15, 2022
Memory Leak when training PPO on a single agent environment RLlib	15	1635	December 24, 2022
[RLlib] GPU Memory Leak? Tune + PPO, Policy Server + Client RLlib	18	1211	May 29, 2023

Memory leak CPU RAM with Tf2 eager execution

Related topics