I’m using the latest Ray (1.11) and noticed the PPO trainer takes a fairly huge amount of ram, roughly 4 gb per worker with some apparent memory leak (training always inevitably crashes due to OOM about 500-600k environment stpes in). Does anyone have suggestions for debugging this? My rollout buffer size is 4096, my observation sizes are an RGBD image and a 2d birds eye map with two channels (128 x 128 x 3: float32, 128 x 128 x 1: float 32, 128 x 128 x 2: float32).
Can you share the script that you use to launch your experiment? This would be a good starting place in helping you out.
If you don’t want to share your environment, you can substituted it for the RLlib random env :
If you are ok downloading our assets (please message me on slack if you have an issue), I made this reproduction:
I’ll try to reproduce on random env.
This takes up 18 gb roughly for me, does that sound about right?
thanks for sharing I’ll take a look
Hi @avnishn! Any updates?
Hi @mjlbach ,
Sorry this took so long.
I’ve tried to reproduce this on Anyscale on master. I can’t reproduce with the provided script (thanks for providing a proper repro script!).
Here’s a screenshot of the resource consumption of roughly 10 hours of training:
Can you confirm that this is error still occurs on your side on master or at least 1.13 before we I try to investigate why it would occur in your environment?
I still managed to reproduce this with ray 2.0 (14 gb of memory, same conda environment.yaml modified to use ray 2.0). I also tried with the latest nightly from today, same memory usage.
The 3GB of memory that you are reading are from after the experiment was run. During the experiment it was very close to 14GB, just as with your run.
Can you reproduce the OOM with the nightlies?
There’s no OOM, just the 14gb of memory usage. Is this memory usage to be expected?
14GB of memory used on a node that treats batches of roughly size 800MB does not seem too high for me.
You post says your are OOMing and the title includes the word memory leak, so I assumed that you where facing an OOM.