Ray PPO :: Memory keeps increasing

System information

  • OS Platform and Distribution : Linux Centos 7.9.2009
  • Ray installed from (source or binary) : source
  • Ray version : 1.2.0
  • Python version : Python 3.6.12

I have around 500 GiB available for training PPO on my server. I’m using around 12 CPUs in total (3 trials, each with 3 rollout_workers and 1 trainer).
From the moment it starts running, the entire process takes up around 340 GiB and it gradually increments. Right now, after only 2 hours and half, it’s using up to 375 GiB. I do have the train_batch_size set to 5000, 15000, and 25000 for each trial respectively, so that should take up some memory. Nevertheless, the amount of memory it seems to require and the speed it increases are incomprehensible for me. The training halts usually after few hours due to low memory error.

The interesting thing is when I ran 10 trials (each also with 3 rollout_works and 1 trainer, so in total of 40 CPUs) on my other server, which has around 250 GiB, it initially only took up 32 GiB. It eventually failed after 8 hours due to low memory usage though.

Can someone explain why it’s so different for each server? Most of all, what should I do to limit the memory so that it still trains, but without abrupt stopping?

Thanks in advance.

Sorry, but it’s impossible to tell from this distance.
We are currently not aware of any memory leaks in PPO (neither tf nor torch). Could this possibly be your env that’s leaking? I’m currently running VizDoom trials on a 4-GPU machine with PPO and attention nets and am not observing any memory leaks. It’s running very stably over several days.