How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
I have actually been stuck on this memory leak for a while. I was originally using ray tune to setup and run my simple experiment. After hundreds of thousands of training iterations, the program would error out with an error telling me I ran out of memory. I tried configuring the program to use different number of workers, configuring the program to work with and without my gpu, and testing the environment itself for memory leaks ( external to ray ).
Because I couldn’t prevent the program from running out of space and I wasn’t at the point where I wanted to tune my algorithm, I changed my program to configure the algorithm directly with PPOConfig. I configured my PPO algorithm to use the MemoryTrackingCallback. I started off with testing only tens of iterations with a gpu and 6 workers. The memory leak persisted, so I tried using only a single worker without a gpu. I noticed that when using a single worker that I wasn’t running out of space on a short test of a couple hundred iterations. However, after that test was successful, I tried 2 workers with no gpu and the RAM usage percentage started to click up pretty quickly.
I went through the data from the custom metrics of the MemoryTrackingCallback and here are some graphs that show which things were ticking up with the RAM usage percentage.
I am not really sure what to do next. I have been stuck with this memory leak for a while now. I have read multiple threads here and on stackoverflow. I have read through the documentation and the source code. I’m stuck.
I think it is important to note that I am running this training program in a docker container as I did read somewhere that linux cgroups could be causing this problem.
I SHOULD ALSO NOTE THAT I AM USING python3.7 and ray[rllib]==2.0.1.
Here is a callback I use to help me track down memory leaks. I wrote and used this a while ago so you may need to update the method signature if it has changed.
You can tune how many objects to report on by changing the 50 to a suitable number.
sorted_object_count[:50]
import gc
class PythonObjectTrackingCallbacks(DefaultCallbacks):
def __init__(self):
super().__init__()
def on_episode_end(
self,
*,
worker,
base_env,
policies,
episode,
env_index=None,
**kwargs):
object_count = defaultdict(
int)
for obj in gc.get_objects():
object_count[str(type(obj))] += 1
sorted_object_count = sorted(object_count.items(), key=lambda item: item[1], reverse=True)
for stat in sorted_object_count[:50]:
obj_type = stat[0]
obj_count = stat[1]
episode.custom_metrics[f"gcobjects/{obj_type}/count"] = obj_count
Well that really doesn’t clear much up does it? Can you share anything about your configuration or setup? A reproduction script?
One thing I would try next is to swap out the real emvironment for a Random Env to try and disentangle if the leak is in the environment or the rl algorithm.
If you have a custom model I would also try switching to a built in model.
I agree that it doesn’t really clear much up. I have a custom model and environment that I am working with. I think it would be hard to get you a reproduction script without sending in the whole repository. Like I said before, I am working inside of a docker container. However, I am currently running a test outside of a container. If that doesn’t clear things up, I will be sure to test a random environment and get you a reproduction script if the memory leak still persists. I’ll post back shortly with info about training outside of docker.
I would also like to note that I don’t think there was a memory leak when using a single worker ( and no gpu ) inside of a docker container. I will go back and check that out.
We also run some memory leak tests. Specifically for PPO. So I’d say it’s not super likely that this stems from inside RLlib itself. This can also happen if your rollouts are crazy long I guess.
I went back and did a longer (relative to earlier) run with a single worker both inside and outside of docker today and it doesn’t seem like there is a difference for the memory leak, so please ignore what I said earlier about it being in docker.
What are your thoughts on the MemoryTrackingCallbacks callback [How To Contribute to RLlib — Ray 3.0.0.dev0] returning worker/data_mean, worker/rss_mean and vms_mean as some of the top 20 memory users that also tick up with the ram usage.
I am gonna try to run the training with the MemoryTrackingCallbacks callback longer… just gonna wait until overnight to do it because the callback seems to slow down the training quite a bit.
I think the first thing you should do would be what @mannyv suggested - replace your env with a random env. It’s very little work.
A crazy long rollout depends on your setting. If your episodes are 10k steps long and you set rollout_fragment_length to 10k, RLlib’s sample collection code will buffer these 10k samples for each env you are evaluating on. So for 100 workers, that would be 1M samples. That’d be a lot of memory even for simple atari envs.
I used the RandomEnv class for the environment. It doesn’t seem like the memory leak persists. I guess I’ll have to check my environment again… I don’t believe my episodes are anywhere close to 10k steps long.