Memory Leak when training PPO on a single agent environment

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I have actually been stuck on this memory leak for a while. I was originally using ray tune to setup and run my simple experiment. After hundreds of thousands of training iterations, the program would error out with an error telling me I ran out of memory. I tried configuring the program to use different number of workers, configuring the program to work with and without my gpu, and testing the environment itself for memory leaks ( external to ray ).

Because I couldn’t prevent the program from running out of space and I wasn’t at the point where I wanted to tune my algorithm, I changed my program to configure the algorithm directly with PPOConfig. I configured my PPO algorithm to use the MemoryTrackingCallback. I started off with testing only tens of iterations with a gpu and 6 workers. The memory leak persisted, so I tried using only a single worker without a gpu. I noticed that when using a single worker that I wasn’t running out of space on a short test of a couple hundred iterations. However, after that test was successful, I tried 2 workers with no gpu and the RAM usage percentage started to click up pretty quickly.

I went through the data from the custom metrics of the MemoryTrackingCallback and here are some graphs that show which things were ticking up with the RAM usage percentage.

ram_util_percent
tracemalloc_worker_data_mean
tracemalloc_worker_rss_mean
tracemalloc_worker_vms_mean

I am not really sure what to do next. I have been stuck with this memory leak for a while now. I have read multiple threads here and on stackoverflow. I have read through the documentation and the source code. I’m stuck.

I think it is important to note that I am running this training program in a docker container as I did read somewhere that linux cgroups could be causing this problem.

I SHOULD ALSO NOTE THAT I AM USING python3.7 and ray[rllib]==2.0.1.

Hi @MrDracoG,

Here is a callback I use to help me track down memory leaks. I wrote and used this a while ago so you may need to update the method signature if it has changed.

You can tune how many objects to report on by changing the 50 to a suitable number.

sorted_object_count[:50]
 import gc
 
 class PythonObjectTrackingCallbacks(DefaultCallbacks):
     def __init__(self):
         super().__init__()
 
     def on_episode_end(
             self,
             *,
             worker,
             base_env,
             policies,
             episode,
             env_index=None,
             **kwargs):
         object_count = defaultdict(
             int)
         for obj in gc.get_objects():
             object_count[str(type(obj))] += 1
 
         sorted_object_count = sorted(object_count.items(), key=lambda item: item[1], reverse=True)
         for stat in sorted_object_count[:50]:
             obj_type = stat[0]
             obj_count = stat[1]
             episode.custom_metrics[f"gcobjects/{obj_type}/count"] = obj_count
3 Likes

Thanks for this cool code @mannyv !

I SHOULD ALSO NOTE THAT I AM USING python3.7 and ray[rllib]==2.0.1.

I ran a quick test with the PythonObjectTrackingCallbacks callback and here are the objects that seem to tick up with the memory usage.

ram_util_percent_0
gcobjects_<class 'cell'>_count_mean
gcobjects_<class 'function'>_count_mean
gcobjects_<class 'tuple'>_count_mean

@MrDracoG,

Well that really doesn’t clear much up does it? Can you share anything about your configuration or setup? A reproduction script?

One thing I would try next is to swap out the real emvironment for a Random Env to try and disentangle if the leak is in the environment or the rl algorithm.

If you have a custom model I would also try switching to a built in model.

I agree that it doesn’t really clear much up. I have a custom model and environment that I am working with. I think it would be hard to get you a reproduction script without sending in the whole repository. Like I said before, I am working inside of a docker container. However, I am currently running a test outside of a container. If that doesn’t clear things up, I will be sure to test a random environment and get you a reproduction script if the memory leak still persists. I’ll post back shortly with info about training outside of docker.

Btw, thank you for taking the time to respond.

Here is the ram usage outside of docker ( 2 workers )…

ram_util_percent_1

The line seems to be fairly flat and, to me, this seems to signal that training inside of a docker container may be causing the memory leak .

Here is the other thread that referenced docker containers and linux cgroups related to a memory leak: Help debugging a memory leak in rllib

I would also like to note that I don’t think there was a memory leak when using a single worker ( and no gpu ) inside of a docker container. I will go back and check that out.

Nevermind, even with a single worker there is a memory leak inside of a docker container.

ram_util_percent_3

We also run some memory leak tests. Specifically for PPO. So I’d say it’s not super likely that this stems from inside RLlib itself. This can also happen if your rollouts are crazy long I guess.

I went back and did a longer (relative to earlier) run with a single worker both inside and outside of docker today and it doesn’t seem like there is a difference for the memory leak, so please ignore what I said earlier about it being in docker.

Inside docker:
ram_util_percent_4

Outside docker:
ram_util_percent_5

Yeah, I think that it is possible to be external to RLlib itself.

What would be an example of a crazy long rollout?

I appreciate your response.

What are your thoughts on the MemoryTrackingCallbacks callback [How To Contribute to RLlib — Ray 3.0.0.dev0] returning worker/data_mean, worker/rss_mean and vms_mean as some of the top 20 memory users that also tick up with the ram usage.

I am gonna try to run the training with the MemoryTrackingCallbacks callback longer… just gonna wait until overnight to do it because the callback seems to slow down the training quite a bit.

I think the first thing you should do would be what @mannyv suggested - replace your env with a random env. It’s very little work.

A crazy long rollout depends on your setting. If your episodes are 10k steps long and you set rollout_fragment_length to 10k, RLlib’s sample collection code will buffer these 10k samples for each env you are evaluating on. So for 100 workers, that would be 1M samples. That’d be a lot of memory even for simple atari envs.

Please use @mannyv’s advice or post a reproduction script.

I used the RandomEnv class for the environment. It doesn’t seem like the memory leak persists. I guess I’ll have to check my environment again… I don’t believe my episodes are anywhere close to 10k steps long.

ram_util_percent_6_random_env

Thank you for your help.

1 Like

Can confirm that I found a leak in my environment. Sorry for wasting the time. Thanks for the help!

ram_util_percent_7

1 Like