Memory Leak when training PPO on a single agent environment

MrDracoG · December 20, 2022, 4:05am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I have actually been stuck on this memory leak for a while. I was originally using ray tune to setup and run my simple experiment. After hundreds of thousands of training iterations, the program would error out with an error telling me I ran out of memory. I tried configuring the program to use different number of workers, configuring the program to work with and without my gpu, and testing the environment itself for memory leaks ( external to ray ).

Because I couldn’t prevent the program from running out of space and I wasn’t at the point where I wanted to tune my algorithm, I changed my program to configure the algorithm directly with PPOConfig. I configured my PPO algorithm to use the MemoryTrackingCallback. I started off with testing only tens of iterations with a gpu and 6 workers. The memory leak persisted, so I tried using only a single worker without a gpu. I noticed that when using a single worker that I wasn’t running out of space on a short test of a couple hundred iterations. However, after that test was successful, I tried 2 workers with no gpu and the RAM usage percentage started to click up pretty quickly.

I went through the data from the custom metrics of the MemoryTrackingCallback and here are some graphs that show which things were ticking up with the RAM usage percentage.

ram_util_percent
tracemalloc_worker_data_mean
tracemalloc_worker_rss_mean
tracemalloc_worker_vms_mean

I am not really sure what to do next. I have been stuck with this memory leak for a while now. I have read multiple threads here and on stackoverflow. I have read through the documentation and the source code. I’m stuck.

I think it is important to note that I am running this training program in a docker container as I did read somewhere that linux cgroups could be causing this problem.

I SHOULD ALSO NOTE THAT I AM USING python3.7 and ray[rllib]==2.0.1.

mannyv · December 20, 2022, 12:35pm

Hi @MrDracoG,

Here is a callback I use to help me track down memory leaks. I wrote and used this a while ago so you may need to update the method signature if it has changed.

You can tune how many objects to report on by changing the 50 to a suitable number.

sorted_object_count[:50]

 import gc
 
 class PythonObjectTrackingCallbacks(DefaultCallbacks):
     def __init__(self):
         super().__init__()
 
     def on_episode_end(
             self,
             *,
             worker,
             base_env,
             policies,
             episode,
             env_index=None,
             **kwargs):
         object_count = defaultdict(
             int)
         for obj in gc.get_objects():
             object_count[str(type(obj))] += 1
 
         sorted_object_count = sorted(object_count.items(), key=lambda item: item[1], reverse=True)
         for stat in sorted_object_count[:50]:
             obj_type = stat[0]
             obj_count = stat[1]
             episode.custom_metrics[f"gcobjects/{obj_type}/count"] = obj_count

arturn · December 20, 2022, 2:35pm

Thanks for this cool code @mannyv !

MrDracoG · December 20, 2022, 4:08pm

I SHOULD ALSO NOTE THAT I AM USING python3.7 and ray[rllib]==2.0.1.

I ran a quick test with the PythonObjectTrackingCallbacks callback and here are the objects that seem to tick up with the memory usage.

ram_util_percent_0
gcobjects_<class 'cell'>_count_mean
gcobjects_<class 'function'>_count_mean
gcobjects_<class 'tuple'>_count_mean

mannyv · December 20, 2022, 4:32pm

@MrDracoG,

Well that really doesn’t clear much up does it? Can you share anything about your configuration or setup? A reproduction script?

One thing I would try next is to swap out the real emvironment for a Random Env to try and disentangle if the leak is in the environment or the rl algorithm.

If you have a custom model I would also try switching to a built in model.

github.com

ray-project/ray/blob/master/rllib/examples/env/random_env.py

import copy
import gymnasium as gym
from gymnasium.spaces import Discrete, Tuple
import numpy as np

from ray.rllib.examples.env.multi_agent import make_multi_agent


class RandomEnv(gym.Env):
    """A randomly acting environment.

    Can be instantiated with arbitrary action-, observation-, and reward
    spaces. Observations and rewards are generated by simply sampling from the
    observation/reward spaces. The probability of a `terminated=True` after each
    action can be configured, as well as the max episode length.
    """

    def __init__(self, config=None):
        config = config or {}

This file has been truncated. show original

MrDracoG · December 20, 2022, 4:59pm

I agree that it doesn’t really clear much up. I have a custom model and environment that I am working with. I think it would be hard to get you a reproduction script without sending in the whole repository. Like I said before, I am working inside of a docker container. However, I am currently running a test outside of a container. If that doesn’t clear things up, I will be sure to test a random environment and get you a reproduction script if the memory leak still persists. I’ll post back shortly with info about training outside of docker.

Btw, thank you for taking the time to respond.

MrDracoG · December 20, 2022, 5:03pm

Here is the ram usage outside of docker ( 2 workers )…

ram_util_percent_1

The line seems to be fairly flat and, to me, this seems to signal that training inside of a docker container may be causing the memory leak .

Here is the other thread that referenced docker containers and linux cgroups related to a memory leak: Help debugging a memory leak in rllib

I would also like to note that I don’t think there was a memory leak when using a single worker ( and no gpu ) inside of a docker container. I will go back and check that out.

MrDracoG · December 20, 2022, 5:45pm

Nevermind, even with a single worker there is a memory leak inside of a docker container.

ram_util_percent_3

arturn · December 20, 2022, 7:11pm

We also run some memory leak tests. Specifically for PPO. So I’d say it’s not super likely that this stems from inside RLlib itself. This can also happen if your rollouts are crazy long I guess.

MrDracoG · December 20, 2022, 11:57pm

I went back and did a longer (relative to earlier) run with a single worker both inside and outside of docker today and it doesn’t seem like there is a difference for the memory leak, so please ignore what I said earlier about it being in docker.

Inside docker:
ram_util_percent_4

Outside docker:
ram_util_percent_5

MrDracoG · December 21, 2022, 12:03am

Yeah, I think that it is possible to be external to RLlib itself.

What would be an example of a crazy long rollout?

I appreciate your response.

MrDracoG · December 21, 2022, 12:09am

What are your thoughts on the MemoryTrackingCallbacks callback [How To Contribute to RLlib — Ray 3.0.0.dev0] returning worker/data_mean, worker/rss_mean and vms_mean as some of the top 20 memory users that also tick up with the ram usage.

I am gonna try to run the training with the MemoryTrackingCallbacks callback longer… just gonna wait until overnight to do it because the callback seems to slow down the training quite a bit.

arturn · December 21, 2022, 9:06am

I think the first thing you should do would be what @mannyv suggested - replace your env with a random env. It’s very little work.

A crazy long rollout depends on your setting. If your episodes are 10k steps long and you set rollout_fragment_length to 10k, RLlib’s sample collection code will buffer these 10k samples for each env you are evaluating on. So for 100 workers, that would be 1M samples. That’d be a lot of memory even for simple atari envs.

arturn · December 21, 2022, 9:07am

Please use @mannyv’s advice or post a reproduction script.

MrDracoG · December 21, 2022, 4:52pm

I used the RandomEnv class for the environment. It doesn’t seem like the memory leak persists. I guess I’ll have to check my environment again… I don’t believe my episodes are anywhere close to 10k steps long.

ram_util_percent_6_random_env

Thank you for your help.

MrDracoG · December 24, 2022, 3:41am

Can confirm that I found a leak in my environment. Sorry for wasting the time. Thanks for the help!

ram_util_percent_7

Topic		Replies	Views
Help debugging a memory leak in rllib RLlib	21	3908	September 25, 2022
Expected RAM usage for PPOTrainer (debugging memory leaks) RLlib	10	960	September 15, 2022
PPO trainer eating up memory RLlib	9	2359	April 2, 2021
[RLlib] GPU Memory Leak? Tune + PPO, Policy Server + Client RLlib	18	1239	May 29, 2023
[RLlib][Tune] Major memory leak 80GB (!) in 3 days (!) RLlib	1	341	June 3, 2021

Memory Leak when training PPO on a single agent environment

Related topics