Help debugging a memory leak in rllib

Bam4d · May 8, 2021, 7:10am

I’m trying to debug a very slow memory leak in rllib that occurs when i am using IMPALA + multi-agent.

I cannot find any leak using tools like tracemalloc so I dont think the memory issue is in python.
ray memory also does not show anty obvious leakage at all.

All of the workers very slowly (over about 8-12 hours) accumulate memory (about 90GB) to the point where processes fail.

The interesting thing about this problem is that I can only reproduce it on a server that is set up with linux cgroups limiting memory. If i run it on my machine at home (with only 16GB of RAM) the leak disappears and happily runs at around 7-10GB.

I’m not looking for a “solution” to this problem, just help in how I can find this leak. Are there any tools or methods that can show me what ray workers are putting on the heap (or even off-heap)? As I said, tracemalloc does not show any leak (other than in function_manager.py… but this is only ~ 100K after several hours rather than 80GB).

Does ray manage GC itself for worker processes? I’m thinking it’s not taking into account cgroup limits when its managing it’s own internal cleanups (so it thinks there is more memory available than there is).

Any help would be appreciated!

Bam4d · May 12, 2021, 5:01pm

I’ve taken the example from another ticket with the same problem: [rllib] Memory leak in environment worker in multi-agent setup · Issue #9964 · ray-project/ray · GitHub and expanded on it to give more logs using tracemalloc etc… These logs go into tensorflow (or wandb if you uncomment the wandb lines)

Here is minimal reproduction:

import argparse
import os
import tracemalloc
from email.policy import Policy
from typing import Optional, Dict

import numpy as np
import psutil
import ray
from gym.spaces import Box
from ray import tune
from ray.rllib import BaseEnv
from ray.rllib.agents.callbacks import DefaultCallbacks
from ray.rllib.agents.impala import ImpalaTrainer
from ray.rllib.env.multi_agent_env import MultiAgentEnv
from ray.rllib.evaluation import MultiAgentEpisode
from ray.rllib.utils.typing import PolicyID
from ray.tune.integration.wandb import WandbLoggerCallback
from ray.tune.registry import register_env


class TraceMallocCallback(DefaultCallbacks):

    def __init__(self):
        super().__init__()

        tracemalloc.start(10)

    def on_episode_end(self, *, worker: "RolloutWorker", base_env: BaseEnv, policies: Dict[PolicyID, Policy],
                       episode: MultiAgentEpisode, env_index: Optional[int] = None, **kwargs) -> None:
        snapshot = tracemalloc.take_snapshot()
        top_stats = snapshot.statistics('lineno')

        for stat in top_stats[:5]:
            count = stat.count
            size = stat.size

            trace = str(stat.traceback)

            episode.custom_metrics[f'tracemalloc/{trace}/size'] = size
            episode.custom_metrics[f'tracemalloc/{trace}/count'] = count

        process = psutil.Process(os.getpid())
        worker_rss = process.memory_info().rss
        worker_data = process.memory_info().data
        worker_vms = process.memory_info().vms
        episode.custom_metrics[f'tracemalloc/worker/rss'] = worker_rss
        episode.custom_metrics[f'tracemalloc/worker/data'] = worker_data
        episode.custom_metrics[f'tracemalloc/worker/vms'] = worker_vms


def dim_to_gym_box(dim, val=np.inf):
    """Create gym.Box with specified dimension."""
    high = np.full((dim,), fill_value=val)
    return Box(low=-high, high=high)


class DummyMultiAgentEnv(MultiAgentEnv):
    """Return zero observations."""

    def __init__(self, config):
        del config  # Unused
        super(DummyMultiAgentEnv, self).__init__()
        self.config = dict(act_dim=17, obs_dim=380, n_players=2, n_steps=1000)
        self.players = ["player_%d" % p for p in range(self.config['n_players'])]
        self.current_step = 0

    def _obs(self):
        return np.zeros((self.config['obs_dim'],))

    def reset(self):
        self.current_step = 0
        return {p: self._obs() for p in self.players}

    def step(self, action_dict):
        done = self.current_step >= self.config['n_steps']
        self.current_step += 1

        obs = {p: self._obs() for p in self.players}
        rew = {p: np.random.random() for p in self.players}
        dones = {p: done for p in self.players + ["__all__"]}
        infos = {p: {'test_thing': 'wahoo'} for p in self.players}

        return obs, rew, dones, infos

    @property
    def observation_space(self):
        return dim_to_gym_box(self.config['obs_dim'])

    @property
    def action_space(self):
        return dim_to_gym_box(self.config['act_dim'])


def create_env(config):
    """Create the dummy environment."""
    return DummyMultiAgentEnv(config)


env_name = "DummyMultiAgentEnv"
register_env(env_name, create_env)


def get_trainer_config(env_config, train_policies, num_workers=5, framework="torch"):
    """Build configuration for 1 run."""

    # trainer config
    config = {
        "env": env_name, "env_config": env_config, "num_workers": num_workers,
        # "multiagent": {"policy_mapping_fn": lambda x: x, "policies": policies,
        #               "policies_to_train": train_policies},
        "framework": framework,
        "train_batch_size": 8192,

        'batch_mode': 'truncate_episodes',

        "callbacks": TraceMallocCallback,
        "lr": 0.0,
    }
    return config


def tune_run():
    parser = argparse.ArgumentParser(description='Run experiments')

    parser.add_argument('--debug', action='store_true', help='Debug mode')
    parser.add_argument('--yaml-file', help='YAML file containing GDY for the game')
    parser.add_argument('--root-directory', default=os.path.expanduser("~/ray_results"))

    args = parser.parse_args()

    #wandbLoggerCallback = WandbLoggerCallback(
    #    project='ma_mem_leak_exp',
    #    api_key_file='~/.wandb_rc',
    #    dir=args.root_directory
    #)

    ray.init(ignore_reinit_error=True, num_gpus=1, include_dashboard=False)
    config = get_trainer_config(train_policies=['player_1', 'player_2'], env_config={})
    return tune.run(ImpalaTrainer,
                    config=config,
                    name="dummy_run",
                    local_dir=args.root_directory)
                    #callbacks=[wandbLoggerCallback])


if __name__ == '__main__':
    tune_run()

You will see the memory ladder like this indefinitely (until crash):

There’s no obviously leaking python objects from from the tracemalloc logging, but the memory usage keeps increasing.

I’ve tried doing gc.collect() every few episodes per worker which doesnt help.

sven1977 · May 12, 2021, 6:08pm

Hey @Bam4d, thanks for the reproduction script! I’ll take a look.

sven1977 · May 12, 2021, 6:52pm

I can’t reproduce this on my Mac. Memory consumption seems very stable. The only change I did was to take out the GPU (num_gpus=0).
I can try again on a GPU machine.

Bam4d · May 12, 2021, 7:18pm

Like i said, its very hard to reproduce, It only seems to happen when on linux and cgroups are enabled. You might have to spin up a docker image (which i think uses cgroups) to reproduce. The issue is that most HPC services use cgroups for resource allocation, so running this on servers/docker will be a problem for many people.

I’ve done some more testing and I can see that possibly the “simple list collector” might be problematic.

Bam4d · May 12, 2021, 7:22pm

These lines simple_list_collector.py:488:

    # Make sure our mappings are up to date.
    agent_key = (episode.episode_id, agent_id)
    self.agent_key_to_policy_id[agent_key] = policy_id

Is the episode ID unique to all episodes? That would mean that this “agent_key_to_policy_id” map would grow forever right?

Bam4d · May 12, 2021, 7:42pm

Created a branch to test this theory here: GitHub - Bam4d/ray at ma_memory_leak

sven1977 · May 12, 2021, 7:49pm

Awesome, could you PR this when confirmed?

Bam4d · May 12, 2021, 7:51pm

yeah just in the process of trying to confirm if this is infact the problem

I’ve also learned that python garbage collections does not actually “see” the limits set by cgroups, so it might just be that python thinks there is alot more memory than there is, and just keeps growing the objects. This might be why its not reproducable in linux (with no cgroups) and mac… either way… I’ll run some more tests and see if mr. leak is gone.

micahtyong · May 13, 2021, 12:50am

Hi @Bam4d, it seems like things are mostly wrapped up here, but in the future, you can try running ray memory --help to unlock some more helpful features. It seems like only object spilling is turned on by default. You can check out the corresponding docs here.

sven1977 · May 13, 2021, 4:35pm

Either way, we are not cleaning up that dict ever, so that’s definitely a great catch by you, even if it’s just a small leak (leaking strings)!

@Bam4d let me know, what else you find. I prepped this PR here it’s good to go, but feel free to do your own PR and ping me here for merging it.

github.com/ray-project/ray

[RLlib] Fix small memory leak in `SimpleListCollector` (ever growing `self.agent_key_to_policy_id` dict).

ray-project:master ← sven1977:fix_small_memory_leak_in_simple_list_collector

opened 04:32PM - 13 May 21 UTC

sven1977

+8 -2

Fix small memory leak in `SimpleListCollector` (ever growing `self.agent_key_t…o_policy_id` dict). Shoutout to @Bam4d for catching this leak! Also see this discussion here: https://discuss.ray.io/t/help-debugging-a-memory-leak-in-rllib/2100/6 ## Why are these changes needed? ## Related issue number ## Checks - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :(

Bam4d · May 13, 2021, 5:04pm

Small leaks over 1B timesteps end up being big leaks

sven1977 · May 13, 2021, 7:32pm

Haha, yeah, absolutely!

mannyv · September 27, 2021, 11:16pm

@Bam4d,

Thank you for taking the time to get your memory tracking callback into the RLlib repo. I had a huge memory leak in my environment that I would have spent forever tracking down without your callback.

Here is a before and after:

earneet · August 29, 2022, 4:17pm

More than a year has passed, has this problem been solved? My colleague has also encountered this trouble recently, is there a feasible solution?

mannyv · August 31, 2022, 12:32pm

Hi @earneet,

Several memory leaks have been found and fixed since that post. Do you have a sense of whether the memory leak is coming from rllib or the environment. We have seen both kinds of memory leaks.

There is an rllib callback you can enable to help find memory leaks. More information about that here: How To Contribute to RLlib — Ray 3.0.0.dev0

earneet · September 1, 2022, 2:42am

thank you for your reply. I found the problem after tracing, it is not a memory leak of ray, it is just that the speed of the learner consuming data is too slow, which leads to the accumulation of sampled data.

Blubberblub · September 15, 2022, 7:42am

@earneet What trace did you do to figure out that the speed of the learner is the problem. I’m having a similar issue right now and am trying to check if this is the case as well.

hridayns · September 19, 2022, 8:38pm

Hello, how did you fix this? Please advise.

earneet · September 20, 2022, 9:49am

i am not very sure. I observed the log and memory curve, when the sample report the memory will skyrocket. Then I increased the number of GPUs and worker_num, then it works.

Topic		Replies	Views
Memory Leak when training PPO on a single agent environment RLlib	15	1757	December 24, 2022
RayOutOfMemoryError RLlib	2	801	May 24, 2021
[RLlib][Tune] Major memory leak 80GB (!) in 3 days (!) RLlib	1	347	June 3, 2021
PPO trainer eating up memory RLlib	9	2396	April 2, 2021
[RLlib] GPU Memory Leak? Tune + PPO, Policy Server + Client RLlib	18	1264	May 29, 2023

Help debugging a memory leak in rllib

Related topics