Ray consumes all my RAM

Hi!

I am new in ray and currently experimenting with it. I’ve written already some similar functions like ray has and try to migrate my framework to ray. Basically I would like to copy the same environment and run them parallel for which ray seems to be quite suitable according to the documentation. I manage to initilaize ray and also create one actor per environment. But when I start iterating over the episode and use .remote() execute one function in the environments and use ray.get() to retrieve the values Ray fully all my RAM and my linux will be dead. I tried to limit the max RAM ray can use in it’s config but it seems not to work.

So my question is after some days of reading the document and debugging if:

  • anybody can give a hint why the memory is eaten away (it should be freed up after each step, but is seems ray might not free up the memory - without ray I had no issues)
  • do I get something wrong with the concept of .remote()?
  • how to limit ray’s max RAM usage (RAM means: not the GPU but the RAM accessible for CPU)

Also I tired to experiment with RLLib examples (copy-paste the ones that exist on Ray’s page but the example I’ve tried does not work and has an error). Can anybody recommend a working example to experiment with Ray/RLLib?

Thanks!

I recommend you to check Ray design patterns — Ray v2.0.0.dev0 and see if you are following any anti-pattern.

thank you Sangcho!

I’ve checked it but in the meantime I narrowed down the issue and I think I’ve found the root cause of it.

It is like the following (I have not found it in any antipattern)

  1. I create 20 environments which I would like to run parallel with ray. Each env is an instance of a class of my custome env

This environment contains a function which is called regularly and returns with pytorch tensors.
If I add the .clone() at the return part of the function ray does not mess up my RAM. If I do not add this .clone() ray will cause RAM issues (even my linux stops working and I need to restart).

So something like this

@ray remote
class MyEnv():
   def __init__():
      mydata_handler_instance = handler_instance() 
   def get_item(idx: int):
      data_on_cpu = my_data_handler_instance.get_the_data(idx)
      return data_on_cpu  ==>causes the RAM issue
      return data_on_cpu.clone() => works fine

And this is the example code I use to call the function in the class of the environments:

ray.get([envs[idx].get_item.remote(idx) for idx in range(nr_of_envs)])

Do you think this is an expected behavior of ray?

Hmm interesting. I am not exactly sure why this happens. I have several questions;

  1. What’s the type of data_on_cpu? (pytorch tensors)? cc @Clark_Zinzow is this zero-copyable?
  2. Is it possible to show me the output of ray memory when you have RAM issue and when you don’t?

Hi! I will have a look. For me it is not 100% clear what do you mean with output of ray memory. Is it the overview of the dashboard where the memory consumption is shown? Or sg else? - if you could add a screenshot and highlight what are you interested in I could show it to you.

Ah, sorry for the confusion. I meant the CLI ray memory. Memory Management — Ray v2.0.0.dev0