PPO trainer eating up memory

Hi there,

I’m trying to train a PPO agent via self play in my multi-agent env. At the moment it can manage about 320 training iterations before my system runs of memory (16gb (with 16gb of swap).) If I restart the training with last checkpoint it made the training can carry on for another 320 or so iterations. This is kinda frustrating as it makes it hard to see what my metrics are doing over a lot of iterations, and I cannot leave the training unattended.

Can you point me in the right direction for debugging this? I’m not really sure what would be causing it.

Here’s the output of ray memory

$ ray memory
 Object ID                                                Reference Type       Object Size   Reference Creation Site
; driver pid=108986
9c0ef4b6eaa363ed69a6825d641b461327313d1c0100000001000000  LOCAL_REFERENCE                ?   (actor call)  | /home/jippo/.conda/envs/yaniv-rl/lib/python3.6/site-packages/ray/util/iter.py:<listcomp>:472 | /home/jippo/.conda/envs/yaniv-rl/lib/python3.6/site-packages/ray/util/iter.py:base_iterator:472 | /home/jippo/.conda/envs/yaniv-rl/lib/python3.6/site-packages/ray/util/iter.py:apply_foreach:783
ffffffffffffffffee4e90da584ab0eb031f18d40100000001000000  LOCAL_REFERENCE                ?   (actor call)  | /home/jippo/.conda/envs/yaniv-rl/lib/python3.6/site-packages/ray/rllib/evaluation/worker_set.py:_make_worker:362 | /home/jippo/.conda/envs/yaniv-rl/lib/python3.6/site-packages/ray/rllib/evaluation/worker_set.py:<listcomp>:141 | /home/jippo/.conda/envs/yaniv-rl/lib/python3.6/site-packages/ray/rllib/evaluation/worker_set.py:add_workers:141
ffffffffffffffffa67dc375e60ddd1a23bd3bb90100000001000000  LOCAL_REFERENCE                ?   (actor call)  | /home/jippo/.conda/envs/yaniv-rl/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py:_setup_remote_runner:317 | /home/jippo/.conda/envs/yaniv-rl/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py:_start_trial:380 | /home/jippo/.conda/envs/yaniv-rl/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py:start_trial:446
55f0c19bd1b4b06b63964fa4841d4a2ecb4575180100000001000000  LOCAL_REFERENCE                ?   (actor call)  | /home/jippo/.conda/envs/yaniv-rl/lib/python3.6/site-packages/ray/util/iter.py:<listcomp>:472 | /home/jippo/.conda/envs/yaniv-rl/lib/python3.6/site-packages/ray/util/iter.py:base_iterator:472 | /home/jippo/.conda/envs/yaniv-rl/lib/python3.6/site-packages/ray/util/iter.py:apply_foreach:783
ffffffffffffffff63964fa4841d4a2ecb4575180100000001000000  LOCAL_REFERENCE                ?   (actor call)  | /home/jippo/.conda/envs/yaniv-rl/lib/python3.6/site-packages/ray/rllib/evaluation/worker_set.py:_make_worker:362 | /home/jippo/.conda/envs/yaniv-rl/lib/python3.6/site-packages/ray/rllib/evaluation/worker_set.py:<listcomp>:141 | /home/jippo/.conda/envs/yaniv-rl/lib/python3.6/site-packages/ray/rllib/evaluation/worker_set.py:add_workers:141
ffffffffffffffff69a6825d641b461327313d1c0100000001000000  LOCAL_REFERENCE                ?   (actor call)  | /home/jippo/.conda/envs/yaniv-rl/lib/python3.6/site-packages/ray/rllib/evaluation/worker_set.py:_make_worker:362 | /home/jippo/.conda/envs/yaniv-rl/lib/python3.6/site-packages/ray/rllib/evaluation/worker_set.py:<listcomp>:141 | /home/jippo/.conda/envs/yaniv-rl/lib/python3.6/site-packages/ray/rllib/evaluation/worker_set.py:add_workers:141
7cca6899f4189233ee4e90da584ab0eb031f18d40100000001000000  LOCAL_REFERENCE                ?   (actor call)  | /home/jippo/.conda/envs/yaniv-rl/lib/python3.6/site-packages/ray/util/iter.py:<listcomp>:472 | /home/jippo/.conda/envs/yaniv-rl/lib/python3.6/site-packages/ray/util/iter.py:base_iterator:472 | /home/jippo/.conda/envs/yaniv-rl/lib/python3.6/site-packages/ray/util/iter.py:apply_foreach:783

--- Aggregate object store stats across all nodes ---
Plasma memory usage 0 MiB, 0 objects, 0.0% full

Here is my tune config: yaniv-rl/rllib_selfplay_test.py at e4ac312e3cf05d68d80a3b93e5efdfa967712968 · york-rwsa/yaniv-rl · GitHub

Many thanks,


Hey @Rory , the script you are linking to is a self-sufficient repro script? Meaning if I ran it as-is would I see the same issue?

Hi @sven1977 yes this script can reproduce the error. Though you will need to clone the repo to get my game environment and install the extra deps (just rlcard). I just tried it on a fresh compute vm with 32gb and it still maxes out.

One other thing, I noticed that it runs much slower on the VM, with 8vcpus and a k80. it has about 1/3rd the learn throughput and half the sample throughput. I was able to start another experiment (starting from a different checkpoint) and run the two side by side without any drop significant drop in either speeds. This makes me wonder if my script is running as efficiently as it can. Is there anything I can do to use more of the resources available? Or can you recommend what I need to increase hw wise to get more performance on a vm (cpu speed or count or a different gpu?). I’m not sure what the stats mean in terms of what is holding back performance. I’ve pasted the timers below:

timers:                                                                                                                  │ 9504 rwsa500    20   0 76.7G 11.5G 1100M S  1.3 39.1  0:00.82 python rllib_selfplay_test.py --num-workers 7
    learn_throughput: 197.303                                                                                              │ 9518 rwsa500    20   0 76.7G 11.5G 1100M S  0.0 39.1  0:00.74 python rllib_selfplay_test.py --num-workers 7
    learn_time_ms: 24160.361                                                                                               │ 9500 rwsa500    20   0 76.7G 11.5G 1100M S  0.0 39.1  0:00.86 python rllib_selfplay_test.py --num-workers 7
    sample_throughput: 329.84                                                                                              │ 9503 rwsa500    20   0 76.7G 11.5G 1100M S  0.0 39.1  0:00.75 python rllib_selfplay_test.py --num-workers 7
    sample_time_ms: 14452.159                                                                                              │ 9499 rwsa500    20   0 76.7G 11.5G 1100M S  0.0 39.1  0:01.05 python rllib_selfplay_test.py --num-workers 7
    update_time_ms: 31.04



edit: Just checked, it turns out I have the same throughput if I use 0 worker or 7. Does this mean I’ve done something wrong in terms of parallelisation?

@Rory Cool, sorry, didn’t notice the link to your script :slight_smile:

On the stats: Your sample time seems small compared to the learning time. You could try increasing the number of workers or envs_per_worker?

I will run your script now see whether I can reproduce the leak …

I’m not seeing any memory leaks so far. Running it w/o GPU and w/o the WandbLogger, though.
Also, I’m noticing you set local_mode=True, which means you would do everything sequentially (all workers) on a single CPU (no GPU, even if num_gpus=1).

Thanks Sven :slight_smile: The number of workers significantly contributes to the increase in usage. The gradient of the increase is very constant and steeper with more workers. It’s not nearly as much of an issue with 0 workers. I also recently installed the nightly build which may have made a difference?

Thanks for the info on the stats. I’m not entirely sure I actually understand what they mean still. As far as I understand; learn_time_ms is how long it takes for the algorithm to update its weights based on the samples collected, and sample_time_ms is how long it takes to gather the samples. If the sample time is smaller than the learn time doesn’t this mean that ray is able to gather enough samples while the rl algo is still learning.

As for local_mode=True, I’d forgotten I had set that! I managed to get it to use my gpu by importing torch and calling torch.cuda.is_available() before starting training. Before doing that it wouldn’t. Didn’t realise this was due to the local_mode setting.

I turned local_mode=False and tried with 1 worker. But It just hangs (left if for 5 mins to see what happens). 0 workers still works without local mode. Any idea how I can figure out what is going wrong? Ray still spawns a bunch of processes though they’re at a very low cpu%

This is the output:

$ python rllib_selfplay_test.py --num-workers 1
2021-03-31 11:21:44,919 INFO services.py:1264 -- View the Ray dashboard at
== Status ==
Memory usage on this node: 5.1/15.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/4 CPUs, 1.0/1 GPUs, 0.0/5.78 GiB heap, 0.0/2.89 GiB objects (0.0/1.0 accelerator_type:GTX)
Result logdir: /home/jippo/ray_results/YanivTrainer_2021-03-31_11-21-46
Number of trials: 1/1 (1 RUNNING)
| Trial name                     | status   | loc   |
| YanivTrainer_yaniv_e5971_00000 | RUNNING  |       |

After a bit of debugging it seems to hang somewhere during the trainer set up:

self.trainer = PPOTrainer(env="yaniv", config=config)

More specifically it fails here: ray/worker_set.py at master · ray-project/ray · GitHub

I’m not sure how to debug workers any further. It just hangs

Can you try on the latest master?
Also, you have to make the following changes:

  1. Remove the to_json() call. tune can now take a PlacementGroupFactory object directly.
  2. make sure local_mode=False
  3. Remove the Wandb logger entirely (I ran w/o WanDB logger and w/ three workers and didn’t see any memory issues):
== Status ==
Memory usage on this node: 8.4/16.0 GiB    <----- HERE: this s very constant throughout the run ---- >
Using FIFO scheduling algorithm.
Resources requested: 5.0/16 CPUs, 0/0 GPUs, 0.0/4.85 GiB heap, 0.0/2.43 GiB objects
Result logdir: /Users/sven/ray_results/YanivTrainer_2021-03-31_12-07-43
Number of trials: 1/1 (1 RUNNING)
| Trial name                     | status   | loc                 |   iter |   total time (s) |      ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
| YanivTrainer_yaniv_ef0bf_00000 | RUNNING  | |    276 |          9408.38 | 1173852 |        0 |                    0 |                    0 |            37.3306 |

On the stats, yes, your interpretation is correct. But also note that for PPO training and rollouts do not happen simultaneously, but sequentially (PPO is on-policy).

May have been unclear in my previous answer: Yes, increasing the num_workers or num_envs_per_worker may further decrease your sampling time, which I think is always a good thing.
If you want to decrease your learning time, you may try multi-GPU (tf-only so far, but we are working on torch support right now).

Fab, thanks Sven! Seems to be working well now :slight_smile: Thanks for the clarity on the numbers too!

1 Like