Memory leak CPU RAM with Tf2 eager execution


When I run training for PPO or APPO on my machine (Ubuntu, 1 15gb GPU and 16 CPU cores with 65GB CPU RAM) I run into out of memory errors late into the training process. I am using the standard API with num_gpus=1 and num_workers=10. GPU memory remains stable during this time, but after some debugging I can watch each of the rollout workers bloat up from ~2Gb in GPU RAM to well over 5Gb, causing my server to choke and kill the process.

I have not been able to figure out the cause of this yet. I am using a custom environment, but have not seen this issue when training on it with other agents. I also did not have this issue with tf1. Any thoughts?

EDIT: It looks like this is a relatively long-standing issue. I was able to get Rllib to work fine with torch, so this is no longer an issue for me.

1 Like

Hey @Samuel_Showalter , sorry for staling this. I agree, we do have an open flank on tf-eager/tf2. It’s on our TODO list to make RLlib with tf-eager more performant and stable.