Memory leak with PPO and GNN custom model

TheExGenesis · September 15, 2021, 9:54pm

Hi, it’s the title says.

My environment is pretty fast and light. But when it comes to training, I run trainer.train() once - it takes 5m and it barely makes it, I run it again and the whole thing crashes.

I tested the model in a supervised training loop and it didn’t leak memory. It’s not the environment because I was generating data directly from it.
I tested the same code with a dummy model and it still crashed, so I’m convinced it’s something about RLLIB’s training loop.

I wonder if there’s a way to debug memory usage by Object. ray memory shows 0B use, I assume because I’m in local mode, and ray dashboard is unusable.

Here is the code with no local dependencies, should reproduce.

gist.github.com

https://gist.github.com/TheExGenesis/d95e5939da5f70bdc037b4d504f89959

egt_gnn_ppo.py

"""
Simple graph env where nodes play social dilemmas. The agent picks a node to turn into a cooperator. The goal is to make the graph converge towards cooperation.
based on simple_graph_env.py but actions choose policy instead of choosing node.

Debug run:
$ python simple_graph_heuristic_custom_ffn_env.py --framework torch --local-mode --stop-iters 5 --stop-timesteps 20000 --stop-reward 1
"""

# %%
from gym import spaces

This file has been truncated. show original

Here’s the error trace when it finally crashes:

gist.github.com

https://gist.github.com/TheExGenesis/9c09a94b3f8814578ae8980af9e33a3c

egt_mem_traceback.txt

torch/complex_input_net.py in __init__(self, obs_space, action_space, num_outputs, model_config, name)
     58         for i, component in enumerate(self.flattened_input_space):
     59             # Image space.
---> 60             if len(component.shape) == 3:
     61                 config = {
     62                     "conv_filters": model_config["conv_filters"]

TypeError: object of type 'NoneType' has no len()
NameError: name 'flatten_space' is not defined
---------------------------------------------------------------------------

This file has been truncated. show original

Finally, rllib doesn’t seem to detect my gpu, even though cuda.is_available() == True.

mannyv · September 16, 2021, 2:51pm

Hi @TheExGenesis,

Perhaps this will be available to you.
https://docs.ray.io/en/latest/rllib-dev.html?highlight=memory#finding-memory-leaks-in-workers

TheExGenesis · September 17, 2021, 11:07am

I solved my problem. I was using default hyperparameters and they were too much. It worked fine using these:

"rollout_fragment_length": 10,
"train_batch_size": 100,
"sgd_minibatch_size": 10,
 "num_sgd_iter": 3,

Topic		Replies	Views
Expected RAM usage for PPOTrainer (debugging memory leaks) RLlib	10	943	September 15, 2022
Memory Leak when training PPO on a single agent environment RLlib	15	1624	December 24, 2022
[RLlib] GPU Memory Leak? Tune + PPO, Policy Server + Client RLlib	18	1206	May 29, 2023
PPO trainer eating up memory RLlib	9	2326	April 2, 2021
PPO with PyTorch GPU has a RAM memory leak for Ray 1.6.0 RLlib	5	667	October 5, 2021

Memory leak with PPO and GNN custom model

Related topics