System information
- OS Platform and Distribution : Linux Centos 7.9.2009
- Ray installed from (source or binary) : source
- Ray version : 1.2.0
- Python version : Python 3.6.12
I have around 500 GiB available for training PPO on my server. I’m using around 12 CPUs in total (3 trials, each with 3 rollout_workers and 1 trainer).
From the moment it starts running, the entire process takes up around 340 GiB and it gradually increments. Right now, after only 2 hours and half, it’s using up to 375 GiB. I do have the train_batch_size
set to 5000, 15000, and 25000 for each trial respectively, so that should take up some memory. Nevertheless, the amount of memory it seems to require and the speed it increases are incomprehensible for me. The training halts usually after few hours due to low memory error.
The interesting thing is when I ran 10 trials (each also with 3 rollout_works and 1 trainer, so in total of 40 CPUs) on my other server, which has around 250 GiB, it initially only took up 32 GiB. It eventually failed after 8 hours due to low memory usage though.
Can someone explain why it’s so different for each server? Most of all, what should I do to limit the memory so that it still trains, but without abrupt stopping?
Thanks in advance.