How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
I have actually been stuck on this memory leak for a while. I was originally using ray tune to setup and run my simple experiment. After hundreds of thousands of training iterations, the program would error out with an error telling me I ran out of memory. I tried configuring the program to use different number of workers, configuring the program to work with and without my gpu, and testing the environment itself for memory leaks ( external to ray ).
Because I couldn’t prevent the program from running out of space and I wasn’t at the point where I wanted to tune my algorithm, I changed my program to configure the algorithm directly with PPOConfig. I configured my PPO algorithm to use the MemoryTrackingCallback. I started off with testing only tens of iterations with a gpu and 6 workers. The memory leak persisted, so I tried using only a single worker without a gpu. I noticed that when using a single worker that I wasn’t running out of space on a short test of a couple hundred iterations. However, after that test was successful, I tried 2 workers with no gpu and the RAM usage percentage started to click up pretty quickly.
I went through the data from the custom metrics of the MemoryTrackingCallback and here are some graphs that show which things were ticking up with the RAM usage percentage.
I am not really sure what to do next. I have been stuck with this memory leak for a while now. I have read multiple threads here and on stackoverflow. I have read through the documentation and the source code. I’m stuck.
I think it is important to note that I am running this training program in a docker container as I did read somewhere that linux cgroups could be causing this problem.
I SHOULD ALSO NOTE THAT I AM USING python3.7 and ray[rllib]==2.0.1.
I agree that it doesn’t really clear much up. I have a custom model and environment that I am working with. I think it would be hard to get you a reproduction script without sending in the whole repository. Like I said before, I am working inside of a docker container. However, I am currently running a test outside of a container. If that doesn’t clear things up, I will be sure to test a random environment and get you a reproduction script if the memory leak still persists. I’ll post back shortly with info about training outside of docker.
I went back and did a longer (relative to earlier) run with a single worker both inside and outside of docker today and it doesn’t seem like there is a difference for the memory leak, so please ignore what I said earlier about it being in docker.
What are your thoughts on the MemoryTrackingCallbacks callback [How To Contribute to RLlib — Ray 3.0.0.dev0] returning worker/data_mean, worker/rss_mean and vms_mean as some of the top 20 memory users that also tick up with the ram usage.
I am gonna try to run the training with the MemoryTrackingCallbacks callback longer… just gonna wait until overnight to do it because the callback seems to slow down the training quite a bit.
I think the first thing you should do would be what @mannyv suggested - replace your env with a random env. It’s very little work.
A crazy long rollout depends on your setting. If your episodes are 10k steps long and you set rollout_fragment_length to 10k, RLlib’s sample collection code will buffer these 10k samples for each env you are evaluating on. So for 100 workers, that would be 1M samples. That’d be a lot of memory even for simple atari envs.
I used the RandomEnv class for the environment. It doesn’t seem like the memory leak persists. I guess I’ll have to check my environment again… I don’t believe my episodes are anywhere close to 10k steps long.